Beating memcmp
Andersama AUG 4, 2021
A Quick Brief
Under usual conditions this is something no one really should need, want or have to do. memcmp
after all is one of the most obvious things to optimize in any program as it's essentially a generalization of one of the most basic instructions cmp
. It's also seriously unlikely we'll outperform memcmp on our own.
Lets first go over a few optimizations that are likely happening when you make use of memcmp. Provided we're using a compiler that analyzes code, it might be inclined to notice when memcmp is used with a fixed length.
int result = memcmp(buf_ptr0, buf_ptr1, 4);
Here the compiler might recognize that a 4 byte wide comparison is in fact, just a cmp
instruction, better yet it might save on some work by just using sub
and sticking the resulting value into the return register.
Surprisingly this is roughly most of what makes a modern memcmp quick, even with a non-fixed size a quick and dirty loop 4 bytes at a time is absurdly fast. It sets flags as well so we can follow it immediately with a conditional jump to exit asap.
A bit of an implicit optimization here with a fixed length comparison is that we don't need to loop, provided a reasonable comparison length, the compiler might remove all loops and just spit out a series of sub
instructions or equivalents.
Another possibility is it might make the comparisons branchless...although we might note here if the compiler is optimizing heavily enough it might rewrite our memcmp to branch only once to exit the loop to fit our needs. To get an idea, lets picture what our own memcmp
might look like.
constexpr int bytecmp(const char* buf_ptr0, const char* buf_ptr1, size_t bytes) noexcept {
int ret = 0;
const size_t loop_count = bytes / sizeof(int);
const size_t loop_rem = bytes % sizeof(int);
//loop 4 bytes at a time
for (size_t i = 0; i < loop_count; i++) {
int tmp = ((const int*)buf_ptr0)[i] - ((const int*)buf_ptr1)[i];
if (tmp != 0) {
return ret = tmp;
}
}
//bump forward
buf_ptr0 += loop_count * sizeof(int);
buf_ptr1 += loop_count * sizeof(int);
//loop a byte at a time
for (size_t i = 0; i < loop_rem; i++) {
int tmp = ((const char*)buf_ptr0)[i] - ((const int*)buf_ptr1)[i];
if (tmp != 0) {
return ret = tmp;
}
}
return ret;
}
bool result = (test.size >= test.size) && bytecmp(test.ptr, test.ptr, test.size) == 0;
Here a branchy disassembly you could propagate the == 0
condition to the loops, jumping when != 0
.
Of course this is simplified from what you might spot inside memcmp
disassembly, but lets see how it does.
//Read op as byte, eg: these benchmarks are roughly running at 8 GB/s (not bad).
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.12 | 8,315,064,635.07 | 2.7% | 12.03 | `memcmp`
| 0.11 | 8,936,764,258.15 | 3.2% | 11.83 | `byte cmp`
Huh, wait, what? We already beat memcmp? What happened?
Well obviously this doesn't provide full context, here we're testing 17 bytes. Meaning we're dealing with two loops, one which is blazing fast, and one which chews through only 1-3 bytes. Worse yet these benchmarks are somewhat long (around 12 seconds) and warming the cache. Maybe we wouldn't do so well if we jumped right into the code cold.
So happens when the cache is dirty does our simple version still do ok?
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.10 | 9,643,190,854.87 | 0.0% | 0.00 | `memcmp`
| 0.10 | 9,644,749,316.32 | 0.0% | 0.00 | `byte cmp`
Wooo... we're riding the edge, removing the minimum run length from the benchmarks means roughly that this is benchmarking one execution each, not a lot of wiggle room to warm a cache. Obviously this isn't exactly a reliable test, a cold cache shouldn't do as well as a warm one after all. But we'll skip over that for now. What happens with different compilers is this even portable? So far these results are from clang, what happens if we try MSVC?
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.11 | 9,084,169,261.92 | 4.4% | 12.58 | `memcmp`
| 0.11 | 9,288,711,765.25 | 4.6% | 12.03 | `byte cmp (native)`
| 0.12 | 8,361,730,502.57 | 6.4% | 12.46 | :wavy_dash: `byte cmp (32)` (Unstable with ~553,541,161.0 iters. Increase `minEpochIterations` to e.g. 5535411610)
| 0.11 | 9,351,078,637.75 | 3.8% | 11.94 | `byte cmp (64)`
//no warming the cache
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.10 | 9,645,184,824.90 | 0.0% | 0.00 | `memcmp`
| 0.10 | 9,643,166,666.67 | 0.0% | 0.00 | `byte cmp (native)` //this is equivalent to byte cmp 64
| 0.10 | 9,645,458,515.28 | 0.0% | 0.00 | `byte cmp (32)`
| 0.10 | 9,646,511,627.91 | 0.0% | 0.00 | `byte cmp (64)`
Here I've added some additional tests for 64bit, and we start to see a bit of variation. In fact here we can see that with as quick a run as possible without hitting the cache hard we're dancing around memcmp
. Also, we can see the results are at least an order of magnitude of each other, so at least we've not shot the pooch at the start.
Here's what our byte_cmp (native) looks like.
constexpr auto byte_cmp(const char* pattern, const char* input, size_t sz) noexcept {
using register_type = std::conditional_t<sizeof(void*) == 8, uint64_t, uint32_t>;
using result_type = std::conditional_t<sizeof(void*) == 8, int64_t, int32_t>;
//div rem can be one op
const size_t loop_count = sz / (sizeof(register_type));
const size_t loop_rem = sz % (sizeof(register_type));
//do a running sub
result_type ret = 0;
for (size_t i = 0; i < loop_count; i++) {
result_type tmp = ((const result_type*)pattern)[i] - ((const result_type*)input)[i];
if (tmp != 0)
return ret = tmp;
}
pattern += (loop_count * sizeof(register_type));
input += (loop_count * sizeof(register_type));
for (size_t i = 0; i < loop_rem; i++) {
result_type tmp = pattern[i] - input[i];
if (tmp != 0)
return ret = tmp;
}
return ret;
}
Couple interesting things to note, one our 64 bit version is likely tied with our 32 bit approach as well as memcmp. However now we're not feeling nutty, surely this function wouldn't outdo memcmp right off the bat!
And unfortunately if you benchmark a bit, it won't be too long before something like this happens (even in clang):
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.10 | 10,032,428,349.11 | 1.2% | 12.15 | `memcmp`
| 0.12 | 8,271,308,848.90 | 8.8% | 12.21 | :wavy_dash: `byte cmp (native)` (Unstable with ~561,770,576.4 iters. Increase `minEpochIterations` to e.g. 5617705764)
| 0.11 | 8,845,268,164.75 | 6.8% | 12.62 | :wavy_dash: `byte cmp (32)` (Unstable with ~594,977,833.9 iters. Increase `minEpochIterations` to e.g. 5949778339)
| 0.12 | 8,624,424,732.14 | 2.4% | 12.25 | `byte cmp (64)`
But we're certainly close. Well...
Hold up
Why? Because all of this so far is a prime example of why we need to be careful of optimizations when benchmarking.
What happened? Both clang and MSVC could see though the test functions I wrote and performed another optimization. Here is what my test looked like:
bench.run("memcmp", [&count]() {
constexpr str_view test = { "0123456789abcdefA" };
bool result = (test.size >= test.size) && std::memcmp(test.ptr, test.ptr, test.size) == 0;
count += result;
});
There's an obvious comparison which is trivially true, but the compiler also works out that memcmping the same pointer is going to always return true. That's right, so far all these results are bunk, why? Because it's entirely dependant on how optimized a simple move is (which can also be optimized away into just an increment). Doh...
Ok lets...try this again, this time we'll rewrite our test to use two different std::strings to store some memory buffers, and lets randomize their contents (once, the same inputs for all tests*).
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.26 | 3,786,876,385.21 | 5.2% | 12.13 | :wavy_dash: `memcmp` (Unstable with ~247,248,813.5 iters. Increase `minEpochIterations` to e.g. 2472488135)
| 0.15 | 6,867,877,477.58 | 4.3% | 11.93 | `byte cmp`
Ok, lesson learned, need to be careful how we write our tests.
Wait what? We beat memcmp again? Did we get lucky?
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.25 | 3,956,294,752.61 | 8.5% | 11.71 | :wavy_dash: `memcmp` (Unstable with ~248,258,903.5 iters. Increase `minEpochIterations` to e.g. 2482589035)
| 0.13 | 7,793,473,898.92 | 0.6% | 11.86 | `byte cmp`
Seems like the answer is we beat it handily, and an almost factor of two is nothing to scoff at. But hey, at least we've learned something, if our compilers pick up on it, they can very easily double our throughput if we do optimize our memcmps away. So where do we go from here? We beat memcmp...can we do better?
(I'm going to give away here that past this point, this is about the best I've done as of writing this)
There are other things we can try, namely for example the code we're generating with the code above is branchy, we could attempt to force the compiler to generate non-branchy code and see what happens.
constexpr auto byte_eq(const char* pattern, const char* input, size_t sz) noexcept {
using register_type = std::conditional_t<sizeof(void*) == 8, uint64_t, uint32_t>;
//div rem can be one op
const size_t loop_count = sz / (sizeof(register_type));
const size_t loop_rem = sz % (sizeof(register_type));
//this performs oddly
register_type ret = 0;
for (std::size_t i = 0; i < loop_count; i++) {
ret |= ((const register_type*)pattern)[i] ^ ((const register_type*)input)[i];
}
pattern += (loop_count * sizeof(register_type));
input += (loop_count * sizeof(register_type));
for (std::size_t i = 0; i < loop_rem; i++) {
ret |= ((const char*)pattern)[i] ^ ((const char*)input)[i];
}
return ret == 0;
}
Here we're accumulating the result into the end buffer...and we're somewhat forgoing how memcmp functions, but that should be ok, after all I'm at least not interested necessarily which buffer looks greater or lesser than another buffer. I really only care if they're equal or not.
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.25 | 3,956,294,752.61 | 8.5% | 11.71 | :wavy_dash: `memcmp` (Unstable with ~248,258,903.5 iters. Increase `minEpochIterations` to e.g. 2482589035)
| 0.13 | 7,793,473,898.92 | 0.6% | 11.86 | `byte cmp`
| 0.45 | 2,239,708,217.48 | 3.5% | 11.52 | `byte eq`
I secretly ran this test with the other bunch...and...it's not looking good, in fact we've managed to seriously hamper our throughput, so badly in fact we fell way bellow memcmp, bummer. What if we try writing it a different way? Maybe other instructions will do better?
constexpr auto byte_eq_2(const char* pattern, const char* input, size_t sz) noexcept {
using register_type = std::conditional_t<sizeof(void*) == 8, uint64_t, uint32_t>;
//div rem can be one op
const size_t loop_count = sz / (sizeof(register_type));
const size_t loop_rem = sz % (sizeof(register_type));
//this performs oddly
register_type ret = 0;
for (std::size_t i = 0; i < loop_count; i++) {
ret |= ((const register_type*)pattern)[i] != ((const register_type*)input)[i];
}
pattern += (loop_count * sizeof(register_type));
input += (loop_count * sizeof(register_type));
for (std::size_t i = 0; i < loop_rem; i++) {
ret |= ((const char*)pattern)[i] != ((const char*)input)[i];
}
return ret == 0;
}
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.27 | 3,663,358,773.86 | 3.6% | 12.16 | `memcmp`
| 0.13 | 7,809,970,727.60 | 0.5% | 11.81 | `byte cmp`
| 0.20 | 5,106,907,826.44 | 5.1% | 11.87 | :wavy_dash: `byte eq` (Unstable with ~329,177,778.0 iters. Increase `minEpochIterations` to e.g. 3291777780)
| 0.25 | 3,999,846,415.60 | 5.4% | 11.52 | :wavy_dash: `byte eq (!=)` (Unstable with ~246,409,890.3 iters. Increase `minEpochIterations` to e.g. 2464098903)
And here we can see again the swings in performance between each run are a bit of a pain to work with, here we manage to beat memcmp...but we're nowhere near our best, memcmp reliably sits around 3.6 GB/s and our not too shabby byte cmp is staying almost at the 8 GB/s range.
But again, this is a little odd, all things considered, we shouldn't beat memcmp so easily. So here's the final kicker, our tests, still not great. Randomized inputs are fine and all, but decently random inputs aren't likely to repeat, meaning what this mostly shows is the speed up relative to failing the first branch. Here we can see our simple loops net us a big win when branching and exiting the loop almost immediately. Any extra work memcmp might be doing at the start is where we're winning out...this also is likely why our branchless approach were so penalized, for every execution we were processing a larger buffer.
Lets just eyeball this sizable list of tests to get a sense for what just happened.
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.25 | 4,046,330,384.27 | 6.4% | 11.47 | :wavy_dash: `memcmp"24"` (Unstable with ~244,920,718.2 iters. Increase `minEpochIterations` to e.g. 2449207182)
| 0.14 | 7,370,993,973.96 | 3.9% | 11.97 | `byte cmp`
| 0.20 | 4,964,076,853.94 | 0.3% | 12.29 | `byte eq`
| 0.28 | 3,620,734,067.55 | 2.7% | 11.47 | `byte eq (!=)`
| 0.24 | 4,138,991,726.17 | 7.9% | 11.49 | :wavy_dash: `memcmp"40"` (Unstable with ~254,644,354.6 iters. Increase `minEpochIterations` to e.g. 2546443546)
| 0.13 | 7,426,310,813.53 | 5.5% | 12.30 | :wavy_dash: `byte cmp` (Unstable with ~478,095,200.5 iters. Increase `minEpochIterations` to e.g. 4780952005)
| 0.26 | 3,915,586,431.19 | 3.2% | 12.20 | `byte eq`
| 0.34 | 2,899,997,156.72 | 3.2% | 12.47 | `byte eq (!=)`
| 0.23 | 4,441,797,824.83 | 4.4% | 12.18 | `memcmp"56"`
| 0.14 | 7,300,000,327.59 | 4.2% | 12.40 | `byte cmp`
| 0.33 | 3,071,213,666.83 | 8.5% | 12.78 | :wavy_dash: `byte eq` (Unstable with ~204,282,824.8 iters. Increase `minEpochIterations` to e.g. 2042828248)
| 0.38 | 2,664,940,472.17 | 0.9% | 12.72 | `byte eq (!=)`
| 0.22 | 4,453,806,337.39 | 4.3% | 12.70 | `memcmp"72"`
| 0.13 | 7,801,686,187.23 | 0.7% | 12.35 | `byte cmp`
| 0.33 | 3,013,588,401.91 | 8.5% | 12.69 | :wavy_dash: `byte eq` (Unstable with ~200,045,549.0 iters. Increase `minEpochIterations` to e.g. 2000455490)
| 0.45 | 2,244,288,702.58 | 6.1% | 11.23 | :wavy_dash: `byte eq (!=)` (Unstable with ~132,189,052.8 iters. Increase `minEpochIterations` to e.g. 1321890528)
| 0.23 | 4,434,297,254.62 | 4.8% | 12.49 | `memcmp"88"`
| 0.14 | 6,998,100,425.38 | 8.6% | 11.88 | :wavy_dash: `byte cmp` (Unstable with ~443,459,830.4 iters. Increase `minEpochIterations` to e.g. 4434598304)
| 0.42 | 2,377,513,275.62 | 6.6% | 11.79 | :wavy_dash: `byte eq` (Unstable with ~147,657,586.1 iters. Increase `minEpochIterations` to e.g. 1476575861)
| 0.55 | 1,811,053,241.45 | 7.8% | 12.11 | :wavy_dash: `byte eq (!=)` (Unstable with ~116,360,232.7 iters. Increase `minEpochIterations` to e.g. 1163602327)
| 0.26 | 3,791,575,862.39 | 11.2% | 12.38 | :wavy_dash: `memcmp"104"` (Unstable with ~250,397,854.8 iters. Increase `minEpochIterations` to e.g. 2503978548)
| 0.14 | 7,195,529,816.80 | 2.2% | 11.98 | `byte cmp`
| 0.41 | 2,449,395,677.01 | 5.1% | 11.39 | :wavy_dash: `byte eq` (Unstable with ~147,096,487.9 iters. Increase `minEpochIterations` to e.g. 1470964879)
| 0.56 | 1,779,211,349.08 | 2.8% | 11.74 | `byte eq (!=)`
| 0.23 | 4,328,351,481.51 | 3.5% | 11.99 | `memcmp"120"`
| 0.14 | 7,156,542,684.07 | 6.3% | 12.22 | :wavy_dash: `byte cmp` (Unstable with ~471,514,275.5 iters. Increase `minEpochIterations` to e.g. 4715142755)
| 0.45 | 2,245,907,884.45 | 7.3% | 11.98 | :wavy_dash: `byte eq` (Unstable with ~144,448,706.2 iters. Increase `minEpochIterations` to e.g. 1444487062)
| 0.70 | 1,422,933,261.25 | 2.4% | 11.69 | `byte eq (!=)`
| 0.24 | 4,242,431,038.35 | 4.2% | 11.97 | `memcmp"136"`
| 0.14 | 7,258,096,274.10 | 8.1% | 12.51 | :wavy_dash: `byte cmp` (Unstable with ~483,247,522.3 iters. Increase `minEpochIterations` to e.g. 4832475223)
| 0.43 | 2,306,867,257.90 | 7.9% | 11.76 | :wavy_dash: `byte eq` (Unstable with ~144,178,502.3 iters. Increase `minEpochIterations` to e.g. 1441785023)
| 0.65 | 1,543,830,036.28 | 0.5% | 12.02 | `byte eq (!=)`
| 0.23 | 4,354,335,948.95 | 6.1% | 12.32 | :wavy_dash: `memcmp"152"` (Unstable with ~276,752,129.6 iters. Increase `minEpochIterations` to e.g. 2767521296)
| 0.14 | 7,299,195,315.64 | 4.4% | 12.06 | `byte cmp`
| 0.49 | 2,022,228,370.61 | 4.0% | 12.09 | `byte eq`
| 0.75 | 1,330,474,281.86 | 3.1% | 12.13 | `byte eq (!=)`
| 0.25 | 4,043,184,511.10 | 10.7% | 10.77 | :wavy_dash: `memcmp"168"` (Unstable with ~224,830,238.6 iters. Increase `minEpochIterations` to e.g. 2248302386)
| 0.14 | 6,995,247,119.14 | 1.2% | 12.00 | `byte cmp`
| 0.48 | 2,071,367,408.95 | 1.3% | 12.16 | `byte eq`
| 0.81 | 1,227,209,026.88 | 6.2% | 11.69 | :wavy_dash: `byte eq (!=)` (Unstable with ~76,139,780.6 iters. Increase `minEpochIterations` to e.g. 761397806)
| 0.22 | 4,577,701,820.55 | 1.5% | 12.03 | `memcmp"184"`
| 0.14 | 7,386,744,343.56 | 5.8% | 12.09 | :wavy_dash: `byte cmp` (Unstable with ~473,635,034.2 iters. Increase `minEpochIterations` to e.g. 4736350342)
| 0.49 | 2,029,253,513.08 | 0.7% | 11.41 | `byte eq`
| 0.88 | 1,135,298,854.61 | 6.4% | 12.40 | :wavy_dash: `byte eq (!=)` (Unstable with ~74,340,130.6 iters. Increase `minEpochIterations` to e.g. 743401306)
| 0.27 | 3,753,485,309.57 | 7.1% | 11.91 | :wavy_dash: `memcmp"200"` (Unstable with ~244,696,870.5 iters. Increase `minEpochIterations` to e.g. 2446968705)
| 0.14 | 7,111,918,970.65 | 7.4% | 12.13 | :wavy_dash: `byte cmp` (Unstable with ~463,766,533.4 iters. Increase `minEpochIterations` to e.g. 4637665334)
| 0.56 | 1,779,636,042.94 | 4.5% | 12.08 | `byte eq`
| 1.03 | 967,118,497.49 | 5.5% | 12.13 | :wavy_dash: `byte eq (!=)` (Unstable with ~61,621,434.0 iters. Increase `minEpochIterations` to e.g. 616214340)
| 0.25 | 4,007,753,865.56 | 3.3% | 12.24 | `memcmp"216"`
| 0.13 | 7,475,094,450.56 | 3.4% | 12.41 | `byte cmp`
| 0.59 | 1,705,190,228.66 | 2.1% | 12.12 | `byte eq`
| 1.00 | 1,002,304,737.82 | 1.7% | 11.89 | `byte eq (!=)`
| 0.24 | 4,114,999,867.43 | 2.0% | 12.48 | `memcmp"232"`
| 0.16 | 6,397,097,177.28 | 3.6% | 12.42 | `byte cmp`
| 0.65 | 1,538,617,584.62 | 4.4% | 11.68 | `byte eq`
| 1.19 | 839,435,609.64 | 8.4% | 11.74 | :wavy_dash: `byte eq (!=)` (Unstable with ~52,500,696.7 iters. Increase `minEpochIterations` to e.g. 525006967)
| 0.25 | 3,928,233,080.56 | 9.9% | 11.63 | :wavy_dash: `memcmp"248"` (Unstable with ~249,480,328.9 iters. Increase `minEpochIterations` to e.g. 2494803289)
| 0.14 | 7,049,151,617.57 | 8.8% | 12.22 | :wavy_dash: `byte cmp` (Unstable with ~460,846,295.3 iters. Increase `minEpochIterations` to e.g. 4608462953)
| 0.64 | 1,570,356,494.87 | 4.2% | 11.84 | `byte eq`
| 1.19 | 841,909,520.04 | 4.3% | 12.31 | `byte eq (!=)`
| 0.23 | 4,391,795,824.70 | 5.7% | 12.15 | :wavy_dash: `memcmp"264"` (Unstable with ~277,373,492.6 iters. Increase `minEpochIterations` to e.g. 2773734926)
| 0.14 | 7,132,664,997.15 | 5.8% | 11.77 | :wavy_dash: `byte cmp` (Unstable with ~453,370,859.0 iters. Increase `minEpochIterations` to e.g. 4533708590)
| 0.61 | 1,649,169,354.38 | 1.0% | 12.00 | `byte eq`
| 1.14 | 874,419,722.86 | 3.3% | 11.77 | `byte eq (!=)`
| 0.22 | 4,551,104,583.69 | 2.1% | 12.14 | `memcmp"280"`
| 0.13 | 7,411,584,615.27 | 3.9% | 12.48 | `byte cmp`
| 0.70 | 1,420,122,705.46 | 6.3% | 12.17 | :wavy_dash: `byte eq` (Unstable with ~89,987,804.7 iters. Increase `minEpochIterations` to e.g. 899878047)
| 1.20 | 836,146,426.99 | 1.9% | 12.22 | `byte eq (!=)`
| 0.23 | 4,402,176,957.85 | 4.9% | 12.21 | `memcmp"296"`
| 0.14 | 7,298,125,614.00 | 7.0% | 11.60 | :wavy_dash: `byte cmp` (Unstable with ~453,302,475.6 iters. Increase `minEpochIterations` to e.g. 4533024756)
| 0.67 | 1,483,098,350.49 | 5.0% | 11.61 | :wavy_dash: `byte eq` (Unstable with ~89,506,321.6 iters. Increase `minEpochIterations` to e.g. 895063216)
| 1.31 | 763,031,130.62 | 3.8% | 11.74 | `byte eq (!=)`
| 0.22 | 4,579,775,904.33 | 1.5% | 11.98 | `memcmp"312"`
| 0.15 | 6,884,728,908.56 | 3.4% | 11.86 | `byte cmp`
| 0.75 | 1,328,266,926.90 | 6.7% | 12.08 | :wavy_dash: `byte eq` (Unstable with ~83,978,695.8 iters. Increase `minEpochIterations` to e.g. 839786958)
| 1.29 | 774,966,326.89 | 0.1% | 11.00 | `byte eq (!=)`
| 0.24 | 4,125,147,798.61 | 11.0% | 12.28 | :wavy_dash: `memcmp"328"` (Unstable with ~263,159,055.1 iters. Increase `minEpochIterations` to e.g. 2631590551)
| 0.13 | 7,454,700,249.36 | 5.2% | 12.15 | :wavy_dash: `byte cmp` (Unstable with ~478,366,777.3 iters. Increase `minEpochIterations` to e.g. 4783667773)
| 0.72 | 1,390,490,194.11 | 2.7% | 11.66 | `byte eq`
| 1.35 | 740,840,291.34 | 0.8% | 11.46 | `byte eq (!=)`
| 0.22 | 4,629,698,928.41 | 0.4% | 11.72 | `memcmp"344"`
| 0.14 | 7,389,344,903.45 | 6.2% | 11.81 | :wavy_dash: `byte cmp` (Unstable with ~467,311,455.5 iters. Increase `minEpochIterations` to e.g. 4673114555)
| 0.87 | 1,147,144,959.30 | 4.0% | 12.27 | `byte eq`
| 1.64 | 610,233,629.33 | 9.6% | 11.87 | :wavy_dash: `byte eq (!=)` (Unstable with ~39,131,483.7 iters. Increase `minEpochIterations` to e.g. 391314837)
| 0.24 | 4,139,980,286.72 | 12.0% | 11.14 | :wavy_dash: `memcmp"360"` (Unstable with ~245,766,829.1 iters. Increase `minEpochIterations` to e.g. 2457668291)
| 0.13 | 7,585,315,280.58 | 3.5% | 11.83 | `byte cmp`
| 0.74 | 1,347,729,294.69 | 1.0% | 12.35 | `byte eq`
| 1.55 | 643,819,171.62 | 4.5% | 11.94 | `byte eq (!=)`
| 0.26 | 3,825,360,591.05 | 10.9% | 12.29 | :wavy_dash: `memcmp"376"` (Unstable with ~243,963,604.3 iters. Increase `minEpochIterations` to e.g. 2439636043)
| 0.14 | 7,029,156,331.60 | 2.1% | 11.86 | `byte cmp`
| 0.83 | 1,202,683,485.08 | 4.1% | 11.56 | `byte eq`
| 1.77 | 563,910,950.70 | 2.4% | 11.78 | `byte eq (!=)`
| 0.24 | 4,187,430,075.91 | 4.3% | 11.76 | `memcmp"392"`
| 0.14 | 7,048,431,149.00 | 5.7% | 11.89 | :wavy_dash: `byte cmp` (Unstable with ~447,293,957.5 iters. Increase `minEpochIterations` to e.g. 4472939575)
| 0.85 | 1,178,385,267.06 | 5.8% | 11.73 | :wavy_dash: `byte eq` (Unstable with ~72,305,285.3 iters. Increase `minEpochIterations` to e.g. 723052853)
| 1.78 | 562,487,022.40 | 3.7% | 12.09 | `byte eq (!=)`
| 0.23 | 4,423,814,857.26 | 4.9% | 12.64 | `memcmp"408"`
| 0.15 | 6,796,746,341.68 | 7.1% | 12.46 | :wavy_dash: `byte cmp` (Unstable with ~459,443,067.8 iters. Increase `minEpochIterations` to e.g. 4594430678)
| 0.88 | 1,137,542,046.09 | 3.2% | 12.44 | `byte eq`
| 1.72 | 582,142,163.61 | 4.9% | 12.45 | `byte eq (!=)`
| 0.22 | 4,620,077,060.55 | 0.5% | 11.85 | `memcmp"424"`
| 0.14 | 7,273,818,979.73 | 4.6% | 11.82 | `byte cmp`
| 0.91 | 1,098,893,548.43 | 6.7% | 12.24 | :wavy_dash: `byte eq` (Unstable with ~70,276,209.8 iters. Increase `minEpochIterations` to e.g. 702762098)
| 1.85 | 540,086,820.81 | 6.7% | 11.99 | :wavy_dash: `byte eq (!=)` (Unstable with ~34,317,256.5 iters. Increase `minEpochIterations` to e.g. 343172565)
| 0.23 | 4,400,356,328.49 | 5.0% | 11.75 | `memcmp"440"`
| 0.13 | 7,480,217,020.50 | 4.9% | 11.84 | `byte cmp`
| 1.01 | 987,846,464.49 | 2.5% | 11.52 | `byte eq`
| 1.87 | 535,387,001.11 | 5.2% | 12.18 | :wavy_dash: `byte eq (!=)` (Unstable with ~34,468,451.4 iters. Increase `minEpochIterations` to e.g. 344684514)
| 0.22 | 4,624,991,236.53 | 0.5% | 11.77 | `memcmp"456"`
| 0.13 | 7,443,278,074.27 | 5.6% | 11.82 | :wavy_dash: `byte cmp` (Unstable with ~459,074,682.2 iters. Increase `minEpochIterations` to e.g. 4590746822)
| 0.90 | 1,108,828,768.25 | 0.3% | 12.18 | `byte eq`
| 1.97 | 508,894,129.36 | 5.7% | 12.30 | :wavy_dash: `byte eq (!=)` (Unstable with ~33,181,440.8 iters. Increase `minEpochIterations` to e.g. 331814408)
| 0.23 | 4,302,957,380.98 | 2.7% | 11.95 | `memcmp"472"`
| 0.13 | 7,654,405,740.71 | 2.6% | 11.62 | `byte cmp`
| 1.04 | 964,699,160.93 | 6.2% | 12.26 | :wavy_dash: `byte eq` (Unstable with ~64,128,333.8 iters. Increase `minEpochIterations` to e.g. 641283338)
| 1.88 | 530,631,433.58 | 0.9% | 11.81 | `byte eq (!=)`
| 0.23 | 4,435,296,343.41 | 4.2% | 12.59 | `memcmp"488"`
| 0.14 | 7,161,806,161.67 | 9.3% | 12.39 | :wavy_dash: `byte cmp` (Unstable with ~466,778,516.9 iters. Increase `minEpochIterations` to e.g. 4667785169)
| 1.00 | 996,279,746.44 | 2.8% | 12.01 | `byte eq`
| 2.11 | 474,008,311.18 | 10.0% | 11.91 | :wavy_dash: `byte eq (!=)` (Unstable with ~29,330,741.5 iters. Increase `minEpochIterations` to e.g. 293307415)
| 0.24 | 4,118,445,582.33 | 7.6% | 11.66 | :wavy_dash: `memcmp"504"` (Unstable with ~258,101,502.7 iters. Increase `minEpochIterations` to e.g. 2581015027)
| 0.14 | 6,953,187,817.13 | 5.0% | 12.47 | `byte cmp`
| 1.09 | 918,124,363.08 | 3.6% | 11.87 | `byte eq`
| 2.12 | 472,328,360.37 | 5.5% | 11.22 | :wavy_dash: `byte eq (!=)` (Unstable with ~27,800,937.2 iters. Increase `minEpochIterations` to e.g. 278009372)
| 0.23 | 4,347,998,784.47 | 5.4% | 11.88 | :wavy_dash: `memcmp"520"` (Unstable with ~270,943,844.1 iters. Increase `minEpochIterations` to e.g. 2709438441)
| 0.14 | 7,214,847,639.10 | 5.4% | 12.00 | :wavy_dash: `byte cmp` (Unstable with ~459,762,683.5 iters. Increase `minEpochIterations` to e.g. 4597626835)
| 1.10 | 905,705,381.33 | 5.2% | 11.39 | :wavy_dash: `byte eq` (Unstable with ~55,405,095.4 iters. Increase `minEpochIterations` to e.g. 554050954)
| 2.23 | 448,642,100.62 | 4.9% | 12.30 | `byte eq (!=)`
| 0.23 | 4,339,987,265.48 | 4.7% | 11.69 | `memcmp"536"`
| 0.13 | 7,817,276,254.68 | 0.5% | 12.17 | `byte cmp`
| 1.10 | 908,291,748.11 | 4.6% | 12.17 | `byte eq`
| 2.11 | 472,993,269.78 | 0.7% | 11.97 | `byte eq (!=)`
| 0.22 | 4,636,829,190.89 | 0.2% | 12.34 | `memcmp"552"`
| 0.13 | 7,848,231,547.27 | 0.1% | 12.13 | `byte cmp`
| 1.09 | 915,351,085.72 | 5.7% | 12.19 | :wavy_dash: `byte eq` (Unstable with ~59,383,320.2 iters. Increase `minEpochIterations` to e.g. 593833202)
| 2.17 | 461,876,320.80 | 0.7% | 11.39 | `byte eq (!=)`
| 0.24 | 4,183,279,621.11 | 2.6% | 12.10 | `memcmp"568"`
| 0.13 | 7,581,436,395.95 | 3.5% | 11.64 | `byte cmp`
| 1.12 | 890,931,579.89 | 3.6% | 11.88 | `byte eq`
| 2.31 | 432,907,870.89 | 3.0% | 12.14 | `byte eq (!=)`
| 0.23 | 4,442,028,753.23 | 4.6% | 12.28 | `memcmp"584"`
| 0.15 | 6,789,354,910.06 | 5.9% | 11.75 | :wavy_dash: `byte cmp` (Unstable with ~424,776,045.1 iters. Increase `minEpochIterations` to e.g. 4247760451)
| 1.18 | 846,486,114.02 | 7.5% | 11.18 | :wavy_dash: `byte eq` (Unstable with ~49,743,870.2 iters. Increase `minEpochIterations` to e.g. 497438702)
| 2.47 | 405,286,243.58 | 4.0% | 11.71 | `byte eq (!=)`
| 0.22 | 4,552,086,663.43 | 2.0% | 12.21 | `memcmp"600"`
| 0.13 | 7,585,928,489.78 | 3.5% | 12.09 | `byte cmp`
| 1.22 | 821,053,334.81 | 3.6% | 11.65 | `byte eq`
| 2.45 | 408,994,970.13 | 4.5% | 12.24 | `byte eq (!=)`
| 0.33 | 3,049,534,276.59 | 23.4% | 12.76 | :wavy_dash: `memcmp"616"` (Unstable with ~190,739,190.6 iters. Increase `minEpochIterations` to e.g. 1907391906)
| 0.13 | 7,672,331,699.24 | 2.0% | 12.03 | `byte cmp`
| 1.20 | 832,349,811.63 | 3.0% | 12.41 | `byte eq`
| 2.53 | 395,621,603.00 | 2.7% | 12.12 | `byte eq (!=)`
| 0.27 | 3,726,093,849.65 | 10.5% | 11.41 | :wavy_dash: `memcmp"632"` (Unstable with ~228,214,830.5 iters. Increase `minEpochIterations` to e.g. 2282148305)
| 0.13 | 7,511,543,707.18 | 4.1% | 12.20 | `byte cmp`
| 1.20 | 834,428,198.69 | 0.9% | 12.24 | `byte eq`
| 2.60 | 384,672,840.50 | 5.1% | 12.59 | :wavy_dash: `byte eq (!=)` (Unstable with ~24,569,951.5 iters. Increase `minEpochIterations` to e.g. 245699515)
| 0.33 | 3,060,013,138.15 | 26.2% | 9.68 | :wavy_dash: `memcmp"648"` (Unstable with ~154,685,863.2 iters. Increase `minEpochIterations` to e.g. 1546858632)
| 0.15 | 6,817,379,385.64 | 6.2% | 12.25 | :wavy_dash: `byte cmp` (Unstable with ~448,016,286.6 iters. Increase `minEpochIterations` to e.g. 4480162866)
| 1.35 | 742,141,647.10 | 10.6% | 11.27 | :wavy_dash: `byte eq` (Unstable with ~44,092,923.7 iters. Increase `minEpochIterations` to e.g. 440929237)
| 2.54 | 394,087,068.53 | 1.4% | 12.48 | `byte eq (!=)`
| 0.39 | 2,537,279,603.82 | 16.5% | 13.74 | :wavy_dash: `memcmp"664"` (Unstable with ~193,561,337.5 iters. Increase `minEpochIterations` to e.g. 1935613375)
| 0.15 | 6,728,508,911.49 | 6.2% | 12.38 | :wavy_dash: `byte cmp` (Unstable with ~445,611,629.5 iters. Increase `minEpochIterations` to e.g. 4456116295)
| 1.39 | 716,892,087.12 | 9.8% | 12.30 | :wavy_dash: `byte eq` (Unstable with ~47,444,723.9 iters. Increase `minEpochIterations` to e.g. 474447239)
| 2.74 | 365,335,541.19 | 5.0% | 11.78 | :wavy_dash: `byte eq (!=)` (Unstable with ~22,773,118.9 iters. Increase `minEpochIterations` to e.g. 227731189)
| 0.46 | 2,157,196,325.88 | 5.4% | 11.22 | :wavy_dash: `memcmp"680"` (Unstable with ~141,538,555.4 iters. Increase `minEpochIterations` to e.g. 1415385554)
| 0.14 | 7,034,047,709.05 | 4.6% | 11.66 | `byte cmp`
| 1.36 | 737,163,607.02 | 6.1% | 11.52 | :wavy_dash: `byte eq` (Unstable with ~45,142,914.9 iters. Increase `minEpochIterations` to e.g. 451429149)
| 2.84 | 352,273,141.68 | 7.0% | 12.70 | :wavy_dash: `byte eq (!=)` (Unstable with ~23,711,185.9 iters. Increase `minEpochIterations` to e.g. 237111859)
| 0.32 | 3,109,907,384.50 | 16.7% | 11.94 | :wavy_dash: `memcmp"696"` (Unstable with ~191,531,403.5 iters. Increase `minEpochIterations` to e.g. 1915314035)
| 0.13 | 7,547,020,369.05 | 3.7% | 11.62 | `byte cmp`
| 1.35 | 738,233,621.45 | 4.0% | 11.53 | `byte eq`
| 2.75 | 363,731,626.96 | 2.4% | 12.31 | `byte eq (!=)`
| 0.33 | 3,057,915,837.95 | 25.3% | 13.54 | :wavy_dash: `memcmp"712"` (Unstable with ~207,569,307.5 iters. Increase `minEpochIterations` to e.g. 2075693075)
| 0.13 | 7,729,408,522.85 | 1.3% | 11.85 | `byte cmp`
| 1.37 | 730,954,014.14 | 3.6% | 11.61 | `byte eq`
| 2.79 | 358,244,907.23 | 2.0% | 11.26 | `byte eq (!=)`
| 0.38 | 2,607,202,520.35 | 2.1% | 12.75 | `memcmp"728"`
| 0.13 | 7,442,046,714.66 | 5.1% | 12.17 | :wavy_dash: `byte cmp` (Unstable with ~469,674,659.5 iters. Increase `minEpochIterations` to e.g. 4696746595)
| 1.45 | 687,408,634.73 | 5.4% | 11.87 | :wavy_dash: `byte eq` (Unstable with ~42,818,742.3 iters. Increase `minEpochIterations` to e.g. 428187423)
| 2.91 | 343,535,180.88 | 3.5% | 12.66 | `byte eq (!=)`
| 0.28 | 3,603,976,550.37 | 11.3% | 13.94 | :wavy_dash: `memcmp"744"` (Unstable with ~241,261,058.8 iters. Increase `minEpochIterations` to e.g. 2412610588)
| 0.14 | 7,310,303,915.73 | 5.6% | 11.96 | :wavy_dash: `byte cmp` (Unstable with ~463,990,951.1 iters. Increase `minEpochIterations` to e.g. 4639909511)
| 1.49 | 672,563,121.70 | 4.8% | 12.05 | `byte eq`
| 3.24 | 308,572,992.58 | 5.5% | 12.15 | :wavy_dash: `byte eq (!=)` (Unstable with ~19,798,529.4 iters. Increase `minEpochIterations` to e.g. 197985294)
| 0.28 | 3,511,078,618.71 | 13.2% | 13.02 | :wavy_dash: `memcmp"760"` (Unstable with ~210,586,639.7 iters. Increase `minEpochIterations` to e.g. 2105866397)
| 0.13 | 7,761,625,212.88 | 0.7% | 12.10 | `byte cmp`
| 1.42 | 706,399,092.93 | 2.1% | 12.46 | `byte eq`
| 3.28 | 304,666,753.36 | 4.6% | 11.72 | `byte eq (!=)`
| 0.31 | 3,276,812,074.41 | 18.2% | 13.15 | :wavy_dash: `memcmp"776"` (Unstable with ~209,135,203.2 iters. Increase `minEpochIterations` to e.g. 2091352032)
| 0.14 | 7,344,571,108.44 | 5.7% | 11.37 | :wavy_dash: `byte cmp` (Unstable with ~425,992,409.9 iters. Increase `minEpochIterations` to e.g. 4259924099)
| 1.62 | 615,800,565.13 | 3.3% | 11.56 | `byte eq`
| 3.09 | 323,477,278.17 | 4.1% | 12.37 | `byte eq (!=)`
| 0.38 | 2,625,428,212.40 | 11.5% | 13.48 | :wavy_dash: `memcmp"792"` (Unstable with ~185,022,838.1 iters. Increase `minEpochIterations` to e.g. 1850228381)
| 0.14 | 7,044,099,699.53 | 2.4% | 12.07 | `byte cmp`
| 1.55 | 644,492,221.03 | 4.2% | 11.92 | `byte eq`
| 3.25 | 307,284,159.28 | 5.1% | 12.55 | :wavy_dash: `byte eq (!=)` (Unstable with ~20,181,097.8 iters. Increase `minEpochIterations` to e.g. 201810978)
| 0.41 | 2,423,039,916.06 | 12.6% | 10.68 | :wavy_dash: `memcmp"808"` (Unstable with ~148,400,143.8 iters. Increase `minEpochIterations` to e.g. 1484001438)
| 0.13 | 7,439,077,323.28 | 5.2% | 11.41 | :wavy_dash: `byte cmp` (Unstable with ~440,406,277.7 iters. Increase `minEpochIterations` to e.g. 4404062777)
| 1.47 | 679,410,718.21 | 2.3% | 12.19 | `byte eq`
| 3.19 | 313,187,424.87 | 3.1% | 11.59 | `byte eq (!=)`
| 0.25 | 4,018,136,019.41 | 7.8% | 11.02 | :wavy_dash: `memcmp"824"` (Unstable with ~211,628,452.5 iters. Increase `minEpochIterations` to e.g. 2116284525)
| 0.14 | 7,111,460,969.13 | 4.8% | 12.19 | `byte cmp`
| 1.59 | 630,283,721.81 | 5.3% | 11.57 | :wavy_dash: `byte eq` (Unstable with ~39,058,481.4 iters. Increase `minEpochIterations` to e.g. 390584814)
| 3.33 | 299,949,334.28 | 5.1% | 12.25 | :wavy_dash: `byte eq (!=)` (Unstable with ~18,885,361.3 iters. Increase `minEpochIterations` to e.g. 188853613)
| 0.36 | 2,803,336,885.56 | 25.9% | 11.92 | :wavy_dash: `memcmp"840"` (Unstable with ~180,300,093.8 iters. Increase `minEpochIterations` to e.g. 1803000938)
| 0.14 | 6,996,681,730.79 | 5.9% | 12.49 | :wavy_dash: `byte cmp` (Unstable with ~459,919,981.3 iters. Increase `minEpochIterations` to e.g. 4599199813)
| 1.53 | 654,836,537.05 | 1.5% | 11.50 | `byte eq`
| 3.39 | 295,043,164.02 | 4.5% | 11.93 | `byte eq (!=)`
| 0.31 | 3,259,429,282.05 | 13.2% | 13.59 | :wavy_dash: `memcmp"856"` (Unstable with ~224,366,947.4 iters. Increase `minEpochIterations` to e.g. 2243669474)
| 0.14 | 7,104,796,392.30 | 4.7% | 12.17 | `byte cmp`
| 1.77 | 565,800,112.16 | 6.4% | 12.12 | :wavy_dash: `byte eq` (Unstable with ~36,897,443.1 iters. Increase `minEpochIterations` to e.g. 368974431)
| 3.61 | 276,797,331.77 | 4.4% | 11.81 | `byte eq (!=)`
| 0.40 | 2,504,048,075.04 | 19.5% | 9.59 | :wavy_dash: `memcmp"872"` (Unstable with ~148,387,400.9 iters. Increase `minEpochIterations` to e.g. 1483874009)
| 0.14 | 7,091,960,812.14 | 5.2% | 12.11 | :wavy_dash: `byte cmp` (Unstable with ~465,386,757.6 iters. Increase `minEpochIterations` to e.g. 4653867576)
| 1.83 | 546,072,906.57 | 9.6% | 12.01 | :wavy_dash: `byte eq` (Unstable with ~35,498,032.8 iters. Increase `minEpochIterations` to e.g. 354980328)
| 3.38 | 295,535,640.53 | 1.9% | 12.22 | `byte eq (!=)`
| 0.34 | 2,949,457,194.35 | 20.4% | 12.53 | :wavy_dash: `memcmp"888"` (Unstable with ~192,783,280.8 iters. Increase `minEpochIterations` to e.g. 1927832808)
| 0.13 | 7,736,131,345.29 | 1.1% | 11.92 | `byte cmp`
| 1.82 | 550,101,421.18 | 7.2% | 11.98 | :wavy_dash: `byte eq` (Unstable with ~35,286,561.2 iters. Increase `minEpochIterations` to e.g. 352865612)
| 3.83 | 260,872,480.57 | 4.5% | 12.66 | `byte eq (!=)`
| 0.27 | 3,704,277,952.52 | 12.2% | 11.34 | :wavy_dash: `memcmp"904"` (Unstable with ~210,707,813.0 iters. Increase `minEpochIterations` to e.g. 2107078130)
| 0.13 | 7,882,976,476.65 | 0.6% | 3,990.83 | `byte cmp`
| 1.62 | 617,608,457.31 | 1.0% | 11.38 | `byte eq`
| 3.41 | 292,913,836.01 | 0.8% | 12.03 | `byte eq (!=)`
| 0.23 | 4,313,453,808.40 | 7.4% | 12.68 | :wavy_dash: `memcmp"920"` (Unstable with ~272,445,153.7 iters. Increase `minEpochIterations` to e.g. 2724451537)
| 0.13 | 7,998,913,841.28 | 0.9% | 12.12 | `byte cmp`
| 1.65 | 606,878,601.12 | 1.2% | 12.23 | `byte eq`
| 3.48 | 287,416,826.92 | 0.2% | 12.11 | `byte eq (!=)`
| 0.22 | 4,643,875,700.16 | 1.2% | 13.27 | `memcmp"936"`
| 0.13 | 7,781,368,280.48 | 0.8% | 12.48 | `byte cmp`
| 1.77 | 564,776,460.45 | 6.3% | 12.30 | :wavy_dash: `byte eq` (Unstable with ~36,268,082.5 iters. Increase `minEpochIterations` to e.g. 362680825)
| 4.15 | 240,734,431.53 | 4.6% | 12.24 | `byte eq (!=)`
| 0.34 | 2,954,823,406.26 | 24.4% | 11.85 | :wavy_dash: `memcmp"952"` (Unstable with ~173,544,559.7 iters. Increase `minEpochIterations` to e.g. 1735445597)
| 0.14 | 6,954,763,685.34 | 5.8% | 11.45 | :wavy_dash: `byte cmp` (Unstable with ~434,073,454.7 iters. Increase `minEpochIterations` to e.g. 4340734547)
| 1.69 | 590,107,252.94 | 0.4% | 12.21 | `byte eq`
| 3.83 | 261,033,487.81 | 3.4% | 12.05 | `byte eq (!=)`
| 0.40 | 2,523,010,029.26 | 10.6% | 13.38 | :wavy_dash: `memcmp"968"` (Unstable with ~188,730,288.7 iters. Increase `minEpochIterations` to e.g. 1887302887)
| 0.13 | 7,616,698,096.12 | 2.8% | 12.70 | `byte cmp`
| 1.74 | 574,524,200.08 | 1.9% | 12.34 | `byte eq`
| 3.95 | 252,865,733.73 | 6.7% | 11.36 | :wavy_dash: `byte eq (!=)` (Unstable with ~15,078,502.7 iters. Increase `minEpochIterations` to e.g. 150785027)
| 0.41 | 2,468,357,871.14 | 15.4% | 11.71 | :wavy_dash: `memcmp"984"` (Unstable with ~159,771,451.7 iters. Increase `minEpochIterations` to e.g. 1597714517)
| 0.15 | 6,501,286,204.18 | 6.1% | 12.32 | :wavy_dash: `byte cmp` (Unstable with ~439,719,433.3 iters. Increase `minEpochIterations` to e.g. 4397194333)
| 1.96 | 511,090,584.00 | 10.3% | 11.38 | :wavy_dash: `byte eq` (Unstable with ~30,655,060.4 iters. Increase `minEpochIterations` to e.g. 306550604)
| 3.76 | 265,827,011.36 | 0.8% | 11.36 | `byte eq (!=)`
| 0.44 | 2,260,189,071.75 | 9.4% | 12.72 | :wavy_dash: `memcmp"1000"` (Unstable with ~173,717,033.4 iters. Increase `minEpochIterations` to e.g. 1737170334)
| 0.15 | 6,783,556,597.24 | 2.8% | 12.68 | `byte cmp`
| 1.99 | 502,051,770.59 | 5.2% | 11.74 | :wavy_dash: `byte eq` (Unstable with ~31,274,577.5 iters. Increase `minEpochIterations` to e.g. 312745775)
| 4.14 | 241,335,244.68 | 3.9% | 11.81 | `byte eq (!=)`
| 0.29 | 3,416,809,155.48 | 14.4% | 10.44 | :wavy_dash: `memcmp"1016"` (Unstable with ~180,149,894.6 iters. Increase `minEpochIterations` to e.g. 1801498946)
| 0.13 | 7,681,387,527.51 | 1.4% | 12.20 | `byte cmp`
| 2.01 | 497,417,007.29 | 11.1% | 12.18 | :wavy_dash: `byte eq` (Unstable with ~31,158,231.8 iters. Increase `minEpochIterations` to e.g. 311582318)
| 4.36 | 229,466,813.08 | 6.7% | 11.91 | :wavy_dash: `byte eq (!=)` (Unstable with ~14,681,869.8 iters. Increase `minEpochIterations` to e.g. 146818698)
And now lets apply our "fix". As we increase the buffer we're dealing with we also increase the number of characters that are shared between each buffer. This will give us a rough sense of what the throughput is like when the buffers are actually similar.
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.29 | 3,476,693,776.57 | 5.6% | 11.77 | :wavy_dash: `memcmp"25"` (Unstable with ~213,432,958.0 iters. Increase `minEpochIterations` to e.g. 2134329580)
| 0.20 | 4,899,506,724.63 | 6.7% | 11.44 | :wavy_dash: `byte cmp` (Unstable with ~298,555,468.1 iters. Increase `minEpochIterations` to e.g. 2985554681)
| 0.30 | 3,334,061,778.95 | 0.8% | 11.89 | `byte eq`
| 0.36 | 2,805,280,837.10 | 4.6% | 12.24 | `byte eq (!=)`
| 0.36 | 2,784,761,521.00 | 4.3% | 12.34 | `memcmp"41"`
| 0.32 | 3,140,275,135.79 | 5.2% | 11.93 | :wavy_dash: `byte cmp` (Unstable with ~200,018,357.5 iters. Increase `minEpochIterations` to e.g. 2000183575)
| 0.40 | 2,521,699,035.04 | 2.3% | 11.78 | `byte eq`
| 0.44 | 2,261,137,325.90 | 3.8% | 11.90 | `byte eq (!=)`
| 0.38 | 2,631,745,445.56 | 4.9% | 11.83 | `memcmp"57"`
| 0.41 | 2,421,149,295.82 | 3.6% | 11.51 | `byte cmp`
| 0.40 | 2,480,112,617.08 | 4.4% | 11.88 | `byte eq`
| 0.54 | 1,868,362,408.37 | 3.4% | 12.03 | `byte eq (!=)`
| 0.43 | 2,305,553,846.12 | 6.7% | 12.72 | :wavy_dash: `memcmp"73"` (Unstable with ~151,652,284.5 iters. Increase `minEpochIterations` to e.g. 1516522845)
| 0.43 | 2,339,957,592.74 | 3.4% | 11.94 | `byte cmp`
| 0.45 | 2,241,144,858.78 | 6.1% | 12.04 | :wavy_dash: `byte eq` (Unstable with ~142,388,420.0 iters. Increase `minEpochIterations` to e.g. 1423884200)
| 0.60 | 1,679,738,186.37 | 5.7% | 11.48 | :wavy_dash: `byte eq (!=)` (Unstable with ~104,703,400.5 iters. Increase `minEpochIterations` to e.g. 1047034005)
| 0.56 | 1,789,077,144.43 | 3.9% | 12.30 | `memcmp"89"`
| 0.61 | 1,651,820,183.17 | 5.1% | 12.15 | :wavy_dash: `byte cmp` (Unstable with ~108,743,188.5 iters. Increase `minEpochIterations` to e.g. 1087431885)
| 0.49 | 2,060,460,521.65 | 3.1% | 11.99 | `byte eq`
| 0.66 | 1,510,655,702.41 | 4.6% | 12.05 | `byte eq (!=)`
| 0.54 | 1,853,409,332.82 | 8.6% | 11.78 | :wavy_dash: `memcmp"105"` (Unstable with ~115,812,833.9 iters. Increase `minEpochIterations` to e.g. 1158128339)
| 0.59 | 1,690,734,332.44 | 5.3% | 11.54 | :wavy_dash: `byte cmp` (Unstable with ~104,191,142.2 iters. Increase `minEpochIterations` to e.g. 1041911422)
| 0.47 | 2,147,333,340.41 | 3.8% | 11.85 | `byte eq`
| 0.68 | 1,475,796,810.90 | 7.7% | 11.78 | :wavy_dash: `byte eq (!=)` (Unstable with ~89,756,506.8 iters. Increase `minEpochIterations` to e.g. 897565068)
| 0.57 | 1,764,199,265.09 | 5.9% | 11.66 | :wavy_dash: `memcmp"121"` (Unstable with ~108,652,541.0 iters. Increase `minEpochIterations` to e.g. 1086525410)
| 0.66 | 1,522,418,590.93 | 6.9% | 11.73 | :wavy_dash: `byte cmp` (Unstable with ~93,982,124.0 iters. Increase `minEpochIterations` to e.g. 939821240)
| 0.47 | 2,110,501,813.29 | 7.6% | 12.32 | :wavy_dash: `byte eq` (Unstable with ~137,513,715.2 iters. Increase `minEpochIterations` to e.g. 1375137152)
| 0.75 | 1,332,201,565.09 | 4.8% | 12.34 | `byte eq (!=)`
| 0.73 | 1,377,267,152.61 | 8.6% | 11.75 | :wavy_dash: `memcmp"137"` (Unstable with ~85,781,541.1 iters. Increase `minEpochIterations` to e.g. 857815411)
| 0.76 | 1,310,045,467.36 | 3.7% | 11.66 | `byte cmp`
| 0.54 | 1,848,486,014.26 | 2.4% | 11.79 | `byte eq`
| 0.81 | 1,230,537,669.77 | 6.0% | 11.71 | :wavy_dash: `byte eq (!=)` (Unstable with ~77,720,689.4 iters. Increase `minEpochIterations` to e.g. 777206894)
| 0.65 | 1,537,214,684.22 | 7.5% | 11.72 | :wavy_dash: `memcmp"153"` (Unstable with ~96,209,264.7 iters. Increase `minEpochIterations` to e.g. 962092647)
| 0.92 | 1,085,080,579.26 | 5.8% | 11.62 | :wavy_dash: `byte cmp` (Unstable with ~66,690,020.5 iters. Increase `minEpochIterations` to e.g. 666900205)
| 0.51 | 1,969,787,764.25 | 3.2% | 11.56 | `byte eq`
| 0.90 | 1,115,973,823.73 | 4.3% | 11.94 | `byte eq (!=)`
| 0.73 | 1,360,700,450.95 | 4.7% | 11.71 | `memcmp"169"`
| 0.87 | 1,143,408,882.67 | 3.2% | 11.75 | `byte cmp`
| 0.57 | 1,758,070,977.27 | 3.5% | 11.86 | `byte eq`
| 0.97 | 1,026,202,513.24 | 4.8% | 12.05 | `byte eq (!=)`
| 0.77 | 1,291,224,949.13 | 3.7% | 11.59 | `memcmp"185"`
| 0.99 | 1,013,288,104.84 | 6.3% | 12.12 | :wavy_dash: `byte cmp` (Unstable with ~64,462,135.4 iters. Increase `minEpochIterations` to e.g. 644621354)
| 0.57 | 1,744,325,739.56 | 5.0% | 11.34 | `byte eq`
| 0.95 | 1,049,049,957.32 | 4.8% | 12.45 | `byte eq (!=)`
| 0.83 | 1,210,982,394.31 | 4.5% | 11.50 | `memcmp"201"`
| 1.00 | 997,955,965.58 | 3.3% | 12.01 | `byte cmp`
| 0.61 | 1,651,852,693.98 | 2.2% | 12.05 | `byte eq`
| 1.05 | 948,902,033.71 | 2.7% | 12.43 | `byte eq (!=)`
| 0.79 | 1,265,398,205.13 | 3.5% | 12.03 | `memcmp"217"`
| 1.11 | 901,692,438.40 | 4.2% | 12.20 | `byte cmp`
| 0.62 | 1,621,013,632.14 | 7.9% | 11.41 | :wavy_dash: `byte eq` (Unstable with ~99,331,740.7 iters. Increase `minEpochIterations` to e.g. 993317407)
| 1.07 | 931,239,113.99 | 4.7% | 11.65 | `byte eq (!=)`
| 0.84 | 1,190,750,543.62 | 5.2% | 11.46 | :wavy_dash: `memcmp"233"` (Unstable with ~71,658,337.2 iters. Increase `minEpochIterations` to e.g. 716583372)
| 1.14 | 880,848,219.04 | 5.1% | 11.67 | :wavy_dash: `byte cmp` (Unstable with ~53,388,510.0 iters. Increase `minEpochIterations` to e.g. 533885100)
| 0.58 | 1,722,235,824.50 | 0.4% | 11.65 | `byte eq`
| 1.27 | 784,940,867.62 | 7.4% | 12.33 | :wavy_dash: `byte eq (!=)` (Unstable with ~51,942,189.6 iters. Increase `minEpochIterations` to e.g. 519421896)
| 0.89 | 1,126,969,066.39 | 5.1% | 11.76 | :wavy_dash: `memcmp"249"` (Unstable with ~71,472,396.8 iters. Increase `minEpochIterations` to e.g. 714723968)
| 1.31 | 765,328,199.77 | 2.0% | 12.28 | `byte cmp`
| 0.63 | 1,577,116,532.60 | 4.8% | 12.39 | `byte eq`
| 1.28 | 779,507,438.92 | 8.7% | 11.52 | :wavy_dash: `byte eq (!=)` (Unstable with ~47,872,477.2 iters. Increase `minEpochIterations` to e.g. 478724772)
| 0.96 | 1,036,750,596.69 | 5.3% | 12.12 | :wavy_dash: `memcmp"265"` (Unstable with ~66,169,226.7 iters. Increase `minEpochIterations` to e.g. 661692267)
| 1.74 | 575,626,712.81 | 3.9% | 12.20 | `byte cmp`
| 0.69 | 1,457,639,696.23 | 4.8% | 11.71 | `byte eq`
| 1.30 | 767,600,426.91 | 7.9% | 12.47 | :wavy_dash: `byte eq (!=)` (Unstable with ~49,507,004.3 iters. Increase `minEpochIterations` to e.g. 495070043)
| 0.95 | 1,048,585,934.06 | 6.1% | 12.18 | :wavy_dash: `memcmp"281"` (Unstable with ~68,405,153.8 iters. Increase `minEpochIterations` to e.g. 684051538)
| 1.40 | 712,163,376.62 | 6.5% | 12.17 | :wavy_dash: `byte cmp` (Unstable with ~46,346,318.7 iters. Increase `minEpochIterations` to e.g. 463463187)
| 0.76 | 1,323,936,889.96 | 4.6% | 11.65 | `byte eq`
| 1.48 | 673,805,466.99 | 3.2% | 12.25 | `byte eq (!=)`
| 1.04 | 962,324,952.99 | 2.6% | 11.61 | `memcmp"297"`
| 1.35 | 739,856,280.67 | 0.8% | 11.69 | `byte cmp`
| 0.75 | 1,338,715,506.22 | 5.2% | 11.97 | :wavy_dash: `byte eq` (Unstable with ~84,431,648.6 iters. Increase `minEpochIterations` to e.g. 844316486)
| 1.48 | 674,586,946.43 | 5.2% | 11.91 | :wavy_dash: `byte eq (!=)` (Unstable with ~42,575,952.3 iters. Increase `minEpochIterations` to e.g. 425759523)
| 0.95 | 1,055,111,550.60 | 3.0% | 11.51 | `memcmp"313"`
| 1.67 | 599,304,201.14 | 3.1% | 12.27 | `byte cmp`
| 0.71 | 1,410,156,368.53 | 4.0% | 12.14 | `byte eq`
| 1.38 | 724,928,189.06 | 0.5% | 11.63 | `byte eq (!=)`
| 1.12 | 896,428,560.82 | 6.3% | 12.56 | :wavy_dash: `memcmp"329"` (Unstable with ~60,565,090.8 iters. Increase `minEpochIterations` to e.g. 605650908)
| 1.64 | 610,675,722.51 | 2.8% | 12.20 | `byte cmp`
| 0.82 | 1,223,422,338.58 | 6.4% | 12.46 | :wavy_dash: `byte eq` (Unstable with ~81,548,917.6 iters. Increase `minEpochIterations` to e.g. 815489176)
| 1.55 | 644,713,636.21 | 2.5% | 12.33 | `byte eq (!=)`
| 1.16 | 864,120,325.76 | 10.7% | 12.26 | :wavy_dash: `memcmp"345"` (Unstable with ~56,113,896.8 iters. Increase `minEpochIterations` to e.g. 561138968)
| 1.90 | 526,270,011.28 | 2.0% | 11.43 | `byte cmp`
| 0.81 | 1,237,165,165.43 | 3.9% | 12.09 | `byte eq`
| 1.63 | 613,681,055.06 | 3.7% | 11.99 | `byte eq (!=)`
| 1.19 | 839,520,589.64 | 10.6% | 11.63 | :wavy_dash: `memcmp"361"` (Unstable with ~53,984,899.3 iters. Increase `minEpochIterations` to e.g. 539848993)
| 2.14 | 467,584,051.37 | 3.0% | 12.01 | `byte cmp`
| 0.82 | 1,216,917,306.23 | 5.5% | 11.73 | :wavy_dash: `byte eq` (Unstable with ~76,822,843.8 iters. Increase `minEpochIterations` to e.g. 768228438)
| 1.71 | 584,157,265.89 | 6.4% | 12.28 | :wavy_dash: `byte eq (!=)` (Unstable with ~38,851,223.5 iters. Increase `minEpochIterations` to e.g. 388512235)
| 1.19 | 841,152,252.06 | 8.5% | 11.86 | :wavy_dash: `memcmp"377"` (Unstable with ~52,424,259.3 iters. Increase `minEpochIterations` to e.g. 524242593)
| 2.01 | 498,130,855.49 | 4.6% | 11.60 | `byte cmp`
| 0.81 | 1,240,206,811.34 | 3.7% | 12.01 | `byte eq`
| 1.73 | 576,687,260.49 | 7.2% | 12.49 | :wavy_dash: `byte eq (!=)` (Unstable with ~38,434,332.2 iters. Increase `minEpochIterations` to e.g. 384343322)
| 1.26 | 793,259,865.29 | 5.1% | 11.96 | :wavy_dash: `memcmp"393"` (Unstable with ~51,653,241.0 iters. Increase `minEpochIterations` to e.g. 516532410)
| 2.03 | 492,108,494.43 | 3.4% | 11.93 | `byte cmp`
| 0.91 | 1,104,029,941.44 | 3.6% | 11.99 | `byte eq`
| 1.78 | 561,004,123.93 | 5.9% | 11.59 | :wavy_dash: `byte eq (!=)` (Unstable with ~33,375,024.4 iters. Increase `minEpochIterations` to e.g. 333750244)
| 1.22 | 820,435,928.47 | 6.5% | 12.52 | :wavy_dash: `memcmp"409"` (Unstable with ~54,816,259.1 iters. Increase `minEpochIterations` to e.g. 548162591)
| 2.28 | 438,325,794.93 | 5.8% | 12.73 | :wavy_dash: `byte cmp` (Unstable with ~28,737,964.1 iters. Increase `minEpochIterations` to e.g. 287379641)
| 0.96 | 1,044,744,110.77 | 8.3% | 12.25 | :wavy_dash: `byte eq` (Unstable with ~68,326,982.2 iters. Increase `minEpochIterations` to e.g. 683269822)
| 2.21 | 451,939,191.71 | 7.2% | 12.07 | :wavy_dash: `byte eq (!=)` (Unstable with ~29,206,734.3 iters. Increase `minEpochIterations` to e.g. 292067343)
| 1.43 | 700,384,271.92 | 3.0% | 12.34 | `memcmp"425"`
| 2.55 | 392,081,625.29 | 5.2% | 11.87 | :wavy_dash: `byte cmp` (Unstable with ~24,886,907.7 iters. Increase `minEpochIterations` to e.g. 248869077)
| 1.01 | 985,224,730.43 | 5.6% | 12.48 | :wavy_dash: `byte eq` (Unstable with ~65,121,514.2 iters. Increase `minEpochIterations` to e.g. 651215142)
| 2.23 | 447,486,383.51 | 2.8% | 12.67 | `byte eq (!=)`
| 1.52 | 657,661,635.94 | 3.8% | 11.78 | `memcmp"441"`
| 2.54 | 394,400,459.55 | 2.6% | 11.87 | `byte cmp`
| 0.96 | 1,044,927,042.13 | 5.2% | 12.44 | :wavy_dash: `byte eq` (Unstable with ~68,552,464.5 iters. Increase `minEpochIterations` to e.g. 685524645)
| 2.14 | 467,783,151.30 | 5.6% | 11.65 | :wavy_dash: `byte eq (!=)` (Unstable with ~28,204,577.0 iters. Increase `minEpochIterations` to e.g. 282045770)
| 1.47 | 678,817,730.14 | 2.8% | 11.86 | `memcmp"457"`
| 2.57 | 389,471,376.74 | 3.9% | 12.24 | `byte cmp`
| 1.00 | 998,469,441.10 | 4.7% | 12.06 | `byte eq`
| 2.14 | 468,317,343.47 | 6.2% | 11.94 | :wavy_dash: `byte eq (!=)` (Unstable with ~29,239,951.2 iters. Increase `minEpochIterations` to e.g. 292399512)
| 1.38 | 726,469,584.71 | 3.5% | 12.19 | `memcmp"473"`
| 2.57 | 389,420,056.06 | 3.1% | 12.21 | `byte cmp`
| 0.92 | 1,083,360,084.14 | 4.8% | 11.57 | `byte eq`
| 2.26 | 441,777,119.86 | 5.9% | 12.46 | :wavy_dash: `byte eq (!=)` (Unstable with ~29,677,028.8 iters. Increase `minEpochIterations` to e.g. 296770288)
| 1.41 | 706,945,915.17 | 6.5% | 11.95 | :wavy_dash: `memcmp"489"` (Unstable with ~44,173,442.7 iters. Increase `minEpochIterations` to e.g. 441734427)
| 2.35 | 425,127,680.25 | 4.1% | 11.94 | `byte cmp`
| 1.02 | 979,266,753.78 | 7.3% | 11.56 | :wavy_dash: `byte eq` (Unstable with ~60,509,944.5 iters. Increase `minEpochIterations` to e.g. 605099445)
| 2.10 | 476,270,519.02 | 4.0% | 12.06 | `byte eq (!=)`
| 1.48 | 677,394,078.52 | 7.1% | 11.63 | :wavy_dash: `memcmp"505"` (Unstable with ~42,530,766.0 iters. Increase `minEpochIterations` to e.g. 425307660)
| 2.68 | 372,931,310.73 | 7.9% | 11.78 | :wavy_dash: `byte cmp` (Unstable with ~23,936,733.7 iters. Increase `minEpochIterations` to e.g. 239367337)
| 1.04 | 958,380,803.71 | 3.9% | 11.99 | `byte eq`
| 2.08 | 480,379,098.49 | 0.7% | 12.26 | `byte eq (!=)`
| 1.50 | 667,655,903.19 | 3.2% | 12.00 | `memcmp"521"`
| 2.65 | 377,845,149.67 | 2.7% | 11.92 | `byte cmp`
| 1.07 | 938,663,476.67 | 4.2% | 12.14 | `byte eq`
| 2.35 | 424,998,949.80 | 5.3% | 11.80 | :wavy_dash: `byte eq (!=)` (Unstable with ~26,314,205.9 iters. Increase `minEpochIterations` to e.g. 263142059)
| 1.50 | 666,195,155.30 | 6.3% | 11.97 | :wavy_dash: `memcmp"537"` (Unstable with ~42,480,849.4 iters. Increase `minEpochIterations` to e.g. 424808494)
| 2.79 | 358,049,600.55 | 4.9% | 12.08 | `byte cmp`
| 1.03 | 973,141,345.64 | 6.8% | 12.13 | :wavy_dash: `byte eq` (Unstable with ~62,050,106.9 iters. Increase `minEpochIterations` to e.g. 620501069)
| 2.54 | 394,074,027.97 | 7.5% | 11.58 | :wavy_dash: `byte eq (!=)` (Unstable with ~24,470,980.1 iters. Increase `minEpochIterations` to e.g. 244709801)
| 1.54 | 650,481,110.99 | 6.1% | 12.26 | :wavy_dash: `memcmp"553"` (Unstable with ~42,154,272.5 iters. Increase `minEpochIterations` to e.g. 421542725)
| 2.70 | 370,639,347.18 | 5.7% | 11.83 | :wavy_dash: `byte cmp` (Unstable with ~23,349,148.8 iters. Increase `minEpochIterations` to e.g. 233491488)
| 1.13 | 886,430,044.88 | 6.5% | 12.24 | :wavy_dash: `byte eq` (Unstable with ~57,444,883.3 iters. Increase `minEpochIterations` to e.g. 574448833)
| 2.61 | 383,798,428.53 | 2.3% | 12.09 | `byte eq (!=)`
| 1.67 | 597,409,644.56 | 4.1% | 12.18 | `memcmp"569"`
| 2.75 | 363,133,118.25 | 4.4% | 12.06 | `byte cmp`
| 1.09 | 919,230,493.68 | 4.9% | 12.20 | `byte eq`
| 2.52 | 397,399,729.86 | 4.0% | 12.20 | `byte eq (!=)`
| 1.62 | 616,135,196.15 | 7.3% | 11.89 | :wavy_dash: `memcmp"585"` (Unstable with ~39,549,500.5 iters. Increase `minEpochIterations` to e.g. 395495005)
| 2.98 | 336,017,957.14 | 3.0% | 12.73 | `byte cmp`
| 1.14 | 876,059,170.73 | 3.2% | 11.62 | `byte eq`
| 2.71 | 368,720,517.79 | 2.6% | 12.41 | `byte eq (!=)`
| 1.69 | 592,264,156.04 | 3.9% | 11.91 | `memcmp"601"`
| 2.81 | 355,263,599.66 | 4.7% | 12.16 | `byte cmp`
| 1.15 | 867,680,088.19 | 5.4% | 12.03 | :wavy_dash: `byte eq` (Unstable with ~56,356,047.3 iters. Increase `minEpochIterations` to e.g. 563560473)
| 2.77 | 360,463,303.20 | 2.5% | 12.20 | `byte eq (!=)`
| 1.80 | 555,712,902.45 | 5.8% | 11.70 | :wavy_dash: `memcmp"617"` (Unstable with ~35,583,179.9 iters. Increase `minEpochIterations` to e.g. 355831799)
| 2.87 | 348,836,388.64 | 2.5% | 12.06 | `byte cmp`
| 1.22 | 817,453,807.94 | 6.3% | 11.86 | :wavy_dash: `byte eq` (Unstable with ~52,565,446.4 iters. Increase `minEpochIterations` to e.g. 525654464)
| 2.49 | 401,238,698.08 | 0.5% | 11.42 | `byte eq (!=)`
| 1.72 | 581,214,095.22 | 6.5% | 12.15 | :wavy_dash: `memcmp"633"` (Unstable with ~37,128,788.6 iters. Increase `minEpochIterations` to e.g. 371287886)
| 3.19 | 313,764,196.76 | 5.0% | 12.15 | `byte cmp`
| 1.19 | 837,237,331.62 | 7.4% | 11.72 | :wavy_dash: `byte eq` (Unstable with ~51,629,349.6 iters. Increase `minEpochIterations` to e.g. 516293496)
| 2.81 | 356,253,939.74 | 8.1% | 11.89 | :wavy_dash: `byte eq (!=)` (Unstable with ~22,692,445.5 iters. Increase `minEpochIterations` to e.g. 226924455)
| 1.80 | 556,489,949.29 | 4.5% | 12.28 | `memcmp"649"`
| 3.12 | 320,376,946.77 | 7.4% | 11.60 | :wavy_dash: `byte cmp` (Unstable with ~19,825,373.7 iters. Increase `minEpochIterations` to e.g. 198253737)
| 1.25 | 799,057,362.43 | 7.0% | 11.29 | :wavy_dash: `byte eq` (Unstable with ~46,598,186.7 iters. Increase `minEpochIterations` to e.g. 465981867)
| 2.91 | 343,121,450.07 | 3.0% | 11.86 | `byte eq (!=)`
| 1.84 | 543,256,714.99 | 4.9% | 11.73 | `memcmp"665"`
| 3.10 | 322,404,240.97 | 7.6% | 10.54 | :wavy_dash: `byte cmp` (Unstable with ~17,512,182.7 iters. Increase `minEpochIterations` to e.g. 175121827)
| 1.26 | 791,586,045.03 | 5.3% | 12.55 | :wavy_dash: `byte eq` (Unstable with ~52,740,407.5 iters. Increase `minEpochIterations` to e.g. 527404075)
| 2.97 | 336,412,772.13 | 3.3% | 12.52 | `byte eq (!=)`
| 1.85 | 541,464,452.35 | 6.2% | 12.00 | :wavy_dash: `memcmp"681"` (Unstable with ~34,806,137.2 iters. Increase `minEpochIterations` to e.g. 348061372)
| 2.94 | 340,648,365.38 | 2.9% | 12.47 | `byte cmp`
| 1.28 | 783,281,011.84 | 6.4% | 12.23 | :wavy_dash: `byte eq` (Unstable with ~51,963,253.6 iters. Increase `minEpochIterations` to e.g. 519632536)
| 3.04 | 328,723,745.49 | 6.8% | 12.26 | :wavy_dash: `byte eq (!=)` (Unstable with ~21,532,684.5 iters. Increase `minEpochIterations` to e.g. 215326845)
| 1.88 | 532,425,699.62 | 5.8% | 11.90 | :wavy_dash: `memcmp"697"` (Unstable with ~33,990,134.6 iters. Increase `minEpochIterations` to e.g. 339901346)
| 3.43 | 291,784,551.45 | 5.8% | 12.26 | :wavy_dash: `byte cmp` (Unstable with ~19,227,599.7 iters. Increase `minEpochIterations` to e.g. 192275997)
| 1.32 | 758,014,327.08 | 4.2% | 11.61 | `byte eq`
| 3.00 | 333,761,148.36 | 6.9% | 11.99 | :wavy_dash: `byte eq (!=)` (Unstable with ~20,935,421.6 iters. Increase `minEpochIterations` to e.g. 209354216)
| 2.01 | 497,656,359.56 | 9.4% | 11.71 | :wavy_dash: `memcmp"713"` (Unstable with ~31,506,369.9 iters. Increase `minEpochIterations` to e.g. 315063699)
| 3.25 | 307,312,953.06 | 4.4% | 12.08 | `byte cmp`
| 1.26 | 794,794,457.04 | 3.4% | 12.07 | `byte eq`
| 3.18 | 314,865,385.81 | 3.2% | 11.67 | `byte eq (!=)`
| 2.03 | 492,639,690.89 | 6.0% | 11.80 | :wavy_dash: `memcmp"729"` (Unstable with ~31,291,224.4 iters. Increase `minEpochIterations` to e.g. 312912244)
| 3.28 | 305,289,215.08 | 5.3% | 11.77 | :wavy_dash: `byte cmp` (Unstable with ~18,684,811.4 iters. Increase `minEpochIterations` to e.g. 186848114)
| 1.30 | 766,844,126.80 | 7.2% | 12.37 | :wavy_dash: `byte eq` (Unstable with ~50,063,013.7 iters. Increase `minEpochIterations` to e.g. 500630137)
| 3.28 | 305,329,718.81 | 2.2% | 11.95 | `byte eq (!=)`
| 2.08 | 481,176,461.09 | 5.4% | 11.46 | :wavy_dash: `memcmp"745"` (Unstable with ~28,881,479.4 iters. Increase `minEpochIterations` to e.g. 288814794)
| 3.57 | 280,050,667.43 | 4.2% | 12.40 | `byte cmp`
| 1.40 | 714,028,723.03 | 3.8% | 12.47 | `byte eq`
| 3.30 | 302,882,097.87 | 5.4% | 11.77 | :wavy_dash: `byte eq (!=)` (Unstable with ~19,304,122.0 iters. Increase `minEpochIterations` to e.g. 193041220)
| 2.16 | 462,184,876.92 | 3.8% | 12.11 | `memcmp"761"`
| 3.50 | 285,625,739.73 | 8.0% | 11.86 | :wavy_dash: `byte cmp` (Unstable with ~18,148,865.3 iters. Increase `minEpochIterations` to e.g. 181488653)
| 1.40 | 714,094,218.32 | 10.9% | 11.95 | :wavy_dash: `byte eq` (Unstable with ~45,025,620.1 iters. Increase `minEpochIterations` to e.g. 450256201)
| 3.11 | 321,359,658.74 | 3.6% | 12.07 | `byte eq (!=)`
| 1.92 | 522,104,284.49 | 1.5% | 11.51 | `memcmp"777"`
| 3.33 | 300,562,064.96 | 5.5% | 12.39 | :wavy_dash: `byte cmp` (Unstable with ~19,521,311.0 iters. Increase `minEpochIterations` to e.g. 195213110)
| 1.30 | 767,025,310.63 | 0.6% | 11.63 | `byte eq`
| 3.58 | 279,668,348.08 | 4.3% | 12.04 | `byte eq (!=)`
| 2.07 | 483,738,651.31 | 1.6% | 12.17 | `memcmp"793"`
| 3.64 | 274,838,914.87 | 5.1% | 11.91 | :wavy_dash: `byte cmp` (Unstable with ~17,387,173.4 iters. Increase `minEpochIterations` to e.g. 173871734)
| 1.41 | 707,374,598.42 | 4.8% | 12.07 | `byte eq`
| 3.44 | 290,472,756.43 | 4.7% | 12.22 | `byte eq (!=)`
| 2.14 | 467,929,051.90 | 7.2% | 11.54 | :wavy_dash: `memcmp"809"` (Unstable with ~28,423,808.6 iters. Increase `minEpochIterations` to e.g. 284238086)
| 3.55 | 281,663,425.20 | 6.8% | 11.95 | :wavy_dash: `byte cmp` (Unstable with ~17,926,491.2 iters. Increase `minEpochIterations` to e.g. 179264912)
| 1.40 | 712,948,929.47 | 5.7% | 11.44 | :wavy_dash: `byte eq` (Unstable with ~43,331,223.5 iters. Increase `minEpochIterations` to e.g. 433312235)
| 3.71 | 269,302,794.98 | 6.8% | 11.16 | :wavy_dash: `byte eq (!=)` (Unstable with ~16,169,133.8 iters. Increase `minEpochIterations` to e.g. 161691338)
| 2.26 | 442,043,910.37 | 9.9% | 12.27 | :wavy_dash: `memcmp"825"` (Unstable with ~28,957,553.8 iters. Increase `minEpochIterations` to e.g. 289575538)
| 3.66 | 273,156,155.59 | 6.1% | 12.27 | :wavy_dash: `byte cmp` (Unstable with ~17,495,327.8 iters. Increase `minEpochIterations` to e.g. 174953278)
| 1.49 | 669,010,768.57 | 7.9% | 11.72 | :wavy_dash: `byte eq` (Unstable with ~41,997,048.8 iters. Increase `minEpochIterations` to e.g. 419970488)
| 3.31 | 302,530,968.23 | 2.5% | 12.18 | `byte eq (!=)`
| 2.19 | 455,767,740.30 | 6.7% | 12.21 | :wavy_dash: `memcmp"841"` (Unstable with ~29,470,363.5 iters. Increase `minEpochIterations` to e.g. 294703635)
| 3.69 | 271,091,551.98 | 3.3% | 12.31 | `byte cmp`
| 1.56 | 641,386,064.95 | 4.2% | 12.16 | `byte eq`
| 3.94 | 253,730,127.09 | 7.0% | 11.12 | :wavy_dash: `byte eq (!=)` (Unstable with ~15,402,423.9 iters. Increase `minEpochIterations` to e.g. 154024239)
| 2.26 | 441,825,173.05 | 5.9% | 12.31 | :wavy_dash: `memcmp"857"` (Unstable with ~28,692,245.5 iters. Increase `minEpochIterations` to e.g. 286922455)
| 3.97 | 252,061,866.06 | 4.0% | 11.47 | `byte cmp`
| 1.52 | 657,259,219.34 | 8.5% | 12.10 | :wavy_dash: `byte eq` (Unstable with ~42,387,575.3 iters. Increase `minEpochIterations` to e.g. 423875753)
| 3.53 | 283,093,390.06 | 5.7% | 11.60 | :wavy_dash: `byte eq (!=)` (Unstable with ~17,163,787.3 iters. Increase `minEpochIterations` to e.g. 171637873)
| 2.22 | 451,124,768.02 | 6.4% | 11.98 | :wavy_dash: `memcmp"873"` (Unstable with ~27,815,493.8 iters. Increase `minEpochIterations` to e.g. 278154938)
| 3.65 | 274,207,224.43 | 2.6% | 11.95 | `byte cmp`
| 1.61 | 620,545,357.44 | 3.1% | 11.85 | `byte eq`
| 4.10 | 244,188,297.70 | 3.2% | 12.55 | `byte eq (!=)`
| 2.29 | 435,840,792.12 | 5.4% | 12.32 | :wavy_dash: `memcmp"889"` (Unstable with ~27,417,301.0 iters. Increase `minEpochIterations` to e.g. 274173010)
| 3.88 | 258,049,426.53 | 7.7% | 11.07 | :wavy_dash: `byte cmp` (Unstable with ~15,279,146.2 iters. Increase `minEpochIterations` to e.g. 152791462)
| 1.73 | 577,087,198.91 | 2.7% | 12.36 | `byte eq`
| 3.85 | 259,738,146.72 | 2.8% | 11.93 | `byte eq (!=)`
| 2.45 | 408,531,058.07 | 5.2% | 12.28 | :wavy_dash: `memcmp"905"` (Unstable with ~26,681,002.2 iters. Increase `minEpochIterations` to e.g. 266810022)
| 3.95 | 253,171,707.54 | 3.8% | 11.37 | `byte cmp`
| 1.59 | 627,119,034.45 | 8.8% | 11.68 | :wavy_dash: `byte eq` (Unstable with ~38,753,346.6 iters. Increase `minEpochIterations` to e.g. 387533466)
| 3.75 | 266,385,014.10 | 6.4% | 11.05 | :wavy_dash: `byte eq (!=)` (Unstable with ~15,196,559.2 iters. Increase `minEpochIterations` to e.g. 151965592)
| 2.28 | 437,831,511.95 | 4.6% | 12.09 | `memcmp"921"`
| 3.93 | 254,363,029.98 | 3.6% | 11.19 | `byte cmp`
| 1.73 | 579,644,407.27 | 4.4% | 12.07 | `byte eq`
| 4.05 | 247,097,612.87 | 4.4% | 11.17 | `byte eq (!=)`
| 2.66 | 375,663,155.56 | 4.7% | 11.86 | `memcmp"937"`
| 3.92 | 254,987,751.81 | 5.7% | 11.19 | :wavy_dash: `byte cmp` (Unstable with ~14,866,583.7 iters. Increase `minEpochIterations` to e.g. 148665837)
| 1.64 | 608,454,210.82 | 5.4% | 11.87 | :wavy_dash: `byte eq` (Unstable with ~37,300,866.5 iters. Increase `minEpochIterations` to e.g. 373008665)
| 3.89 | 256,810,834.80 | 7.0% | 11.06 | :wavy_dash: `byte eq (!=)` (Unstable with ~14,754,290.1 iters. Increase `minEpochIterations` to e.g. 147542901)
| 2.26 | 442,422,415.64 | 0.5% | 11.50 | `memcmp"953"`
| 4.08 | 244,951,836.43 | 4.5% | 11.61 | `byte cmp`
| 1.66 | 604,123,296.32 | 7.0% | 11.96 | :wavy_dash: `byte eq` (Unstable with ~38,174,710.9 iters. Increase `minEpochIterations` to e.g. 381747109)
| 4.27 | 234,112,811.86 | 3.1% | 12.67 | `byte eq (!=)`
| 2.77 | 361,352,758.28 | 4.3% | 12.13 | `memcmp"969"`
| 3.99 | 250,853,287.74 | 4.2% | 12.17 | `byte cmp`
| 1.70 | 589,727,848.49 | 7.3% | 11.66 | :wavy_dash: `byte eq` (Unstable with ~36,112,283.2 iters. Increase `minEpochIterations` to e.g. 361122832)
| 3.99 | 250,402,390.59 | 3.7% | 11.77 | `byte eq (!=)`
| 2.32 | 431,793,040.59 | 0.0% | 11.81 | `memcmp"985"`
| 4.26 | 235,014,476.20 | 4.5% | 11.28 | `byte cmp`
| 1.57 | 635,989,693.44 | 2.2% | 11.74 | `byte eq`
| 4.32 | 231,421,505.63 | 4.9% | 11.93 | `byte eq (!=)`
| 2.45 | 408,019,544.60 | 3.8% | 12.04 | `memcmp"1001"`
| 4.07 | 245,534,786.65 | 6.0% | 11.75 | :wavy_dash: `byte cmp` (Unstable with ~15,372,727.6 iters. Increase `minEpochIterations` to e.g. 153727276)
| 1.72 | 581,532,335.48 | 6.2% | 11.86 | :wavy_dash: `byte eq` (Unstable with ~36,503,161.2 iters. Increase `minEpochIterations` to e.g. 365031612)
| 4.22 | 236,768,344.14 | 5.1% | 11.28 | :wavy_dash: `byte eq (!=)` (Unstable with ~14,260,624.3 iters. Increase `minEpochIterations` to e.g. 142606243)
| 2.62 | 381,649,881.73 | 6.1% | 12.31 | :wavy_dash: `memcmp"1017"` (Unstable with ~25,051,112.9 iters. Increase `minEpochIterations` to e.g. 250511129)
| 4.55 | 219,571,423.96 | 6.1% | 11.85 | :wavy_dash: `byte cmp` (Unstable with ~14,183,872.2 iters. Increase `minEpochIterations` to e.g. 141838722)
| 1.73 | 577,101,234.43 | 7.4% | 11.66 | :wavy_dash: `byte eq` (Unstable with ~35,800,854.3 iters. Increase `minEpochIterations` to e.g. 358008543)
| 4.42 | 226,116,576.86 | 5.7% | 11.90 | :wavy_dash: `byte eq (!=)` (Unstable with ~14,237,177.3 iters. Increase `minEpochIterations` to e.g. 142371773)
And now we can see that our branchless optimization was in fact an optimization...although it really needed to be operating on larger buffer. These tests I added an extra random 16 bytes at the end (so they're still not likely to ever match), but now we can see the throughput. At this point I might be tempted tune this function and really put the squeeze on memcmp in all cases.
But I know one last thing, which is the disassembly I was trying to mock up with my example. If we take a peek at memcmp's you'll be finding that memcmp is also tuned a bit. In MSVC at least memcmp is checking the size of the buffer and doing something different when the buffer is large. clang I have to assume is doing something similar. In short in order to beat memcmp we'll have to absorb the cost of an additional branch and then optimize two different cases (one for a larger vs smaller buffer). This also explains why we won out with our quick and dirty example when the data was completely random, we were avoiding the cost of a branch instruction / misprediction on the byte's size. We can also see here how small a window we had, a single branch missprediction was likely the entire difference, and it only held out for maybe 73 - 105 matching bytes.
Obviously though this probably invokes a few bits of advice we've all likely heard over and over, don't reinvent the wheel, test and benchmark your code, know your hardware (I'm working with an i7 6800k (x64)) and probably a lot more etc etc. I would say though at least at an individual level don't just trust your library writers to write general code to fit a specific use case. Here for example I know what I was after (I definitely would've liked to have definitively destroyed memcmp in a benchmark with something so simple), but I also know my inputs. Although my example wasn't a decisive winner (and it definitely fell off with larger inputs) I know that it would be for what I'm dealing with.
But is this truly the best we can do? Are there no tricks left? Well of course not, as we can see here we've learned a key detail, one we're paying a hefty price for going branchless...but so far we've gone all in. A branch after all is a chance to exit from the loop, this means that a branchless approach heavily favors processing inputs which match, where a branchy approach heavily favors ones which don't. Obviously it entirely depends on what you're dealing with, but a fair guess would lean that of all the input combinations, two inputs which match are less likely than those which don't. Suppose we compromise? Do chunks of data branchlessly and then branch less often?
constexpr auto byte_eq4(const char* pattern, const char* input, size_t sz) noexcept {
using register_type = std::conditional_t<sizeof(void*) == 8, uint64_t, uint32_t>;
//div rem can be one op
const size_t loop_count = sz / (sizeof(register_type));
const size_t loop_rem = sz % (sizeof(register_type));
//this performs oddly
register_type ret = 0;
size_t i = 0;
for (; (i+3) < loop_count; i+=4) {
ret |= ((const register_type*)pattern)[i] ^ ((const register_type*)input)[i];
ret |= ((const register_type*)pattern)[i+1] ^ ((const register_type*)input)[i+1];
ret |= ((const register_type*)pattern)[i+2] ^ ((const register_type*)input)[i+2];
ret |= ((const register_type*)pattern)[i+3] ^ ((const register_type*)input)[i+3];
}
for (; i < loop_count; i++) {
ret |= ((const register_type*)pattern)[i] ^ ((const register_type*)input)[i];
}
pattern += (loop_count * sizeof(register_type));
input += (loop_count * sizeof(register_type));
for (; i < sz; i++) {
ret |= ((const char*)pattern)[i] ^ ((const char*)input)[i];
}
return ret;
}
constexpr auto byte_eq5(const char* pattern, const char* input, size_t sz) noexcept {
using register_type = std::conditional_t<sizeof(void*) == 8, uint64_t, uint32_t>;
//div rem can be one op
const size_t loop_count = sz / (sizeof(register_type));
const size_t loop_rem = sz % (sizeof(register_type));
//this performs oddly
register_type ret = 0;
size_t i = 0;
for (; (i + 3) < loop_count; i += 4) {
ret |= ((const register_type*)pattern)[i] ^ ((const register_type*)input)[i];
ret |= ((const register_type*)pattern)[i + 1] ^ ((const register_type*)input)[i + 1];
ret |= ((const register_type*)pattern)[i + 2] ^ ((const register_type*)input)[i + 2];
ret |= ((const register_type*)pattern)[i + 3] ^ ((const register_type*)input)[i + 3];
if (ret != 0)
return ret;
}
for (; i < loop_count; i++) {
ret |= ((const register_type*)pattern)[i] ^ ((const register_type*)input)[i];
}
if (ret != 0)
return ret;
pattern += (loop_count * sizeof(register_type));
input += (loop_count * sizeof(register_type));
for (size_t i = 0; i < loop_rem; i++) {
ret |= ((const char*)pattern)[i] ^ ((const char*)input)[i];
}
return ret;
}
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.34 | 2,934,870,975.51 | 4.8% | 11.98 | `memcmp"31"`
| 0.20 | 4,888,021,084.78 | 7.4% | 12.20 | :wavy_dash: `byte cmp` (Unstable with ~316,051,392.3 iters. Increase `minEpochIterations` to e.g. 3160513923)
| 0.51 | 1,976,468,440.40 | 3.4% | 12.10 | `byte eq`
| 0.58 | 1,723,198,756.28 | 5.7% | 11.94 | :wavy_dash: `byte eq (!=)` (Unstable with ~107,431,129.8 iters. Increase `minEpochIterations` to e.g. 1074311298)
| 1.89 | 527,855,641.79 | 5.3% | 12.12 | :wavy_dash: `byte eq (unroll)` (Unstable with ~34,047,694.6 iters. Increase `minEpochIterations` to e.g. 340476946)
| 0.19 | 5,261,142,160.85 | 8.1% | 12.38 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~341,912,621.1 iters. Increase `minEpochIterations` to e.g. 3419126211)
| 0.37 | 2,727,117,782.49 | 8.1% | 12.73 | :wavy_dash: `memcmp"47"` (Unstable with ~181,456,680.2 iters. Increase `minEpochIterations` to e.g. 1814566802)
| 0.31 | 3,239,085,348.29 | 2.7% | 12.11 | `byte cmp`
| 0.83 | 1,198,993,949.19 | 3.9% | 11.66 | `byte eq`
| 0.61 | 1,649,871,834.21 | 1.2% | 11.95 | `byte eq (!=)`
| 2.27 | 439,992,000.30 | 1.8% | 11.88 | `byte eq (unroll)`
| 0.29 | 3,483,814,908.88 | 2.6% | 11.68 | `byte eq (unroll branch)`
| 0.38 | 2,609,665,348.19 | 1.3% | 12.13 | `memcmp"63"`
| 0.33 | 3,057,531,504.46 | 0.6% | 12.15 | `byte cmp`
| 0.57 | 1,747,149,157.70 | 1.4% | 11.89 | `byte eq`
| 0.69 | 1,447,469,653.36 | 5.8% | 12.73 | :wavy_dash: `byte eq (!=)` (Unstable with ~93,130,180.5 iters. Increase `minEpochIterations` to e.g. 931301805)
| 3.14 | 318,632,217.36 | 5.1% | 11.22 | :wavy_dash: `byte eq (unroll)` (Unstable with ~19,368,564.1 iters. Increase `minEpochIterations` to e.g. 193685641)
| 0.29 | 3,399,375,405.44 | 9.2% | 11.83 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~214,605,850.0 iters. Increase `minEpochIterations` to e.g. 2146058500)
| 0.46 | 2,174,807,354.85 | 6.8% | 12.32 | :wavy_dash: `memcmp"79"` (Unstable with ~142,016,614.0 iters. Increase `minEpochIterations` to e.g. 1420166140)
| 0.43 | 2,312,404,624.04 | 3.5% | 11.00 | `byte cmp`
| 0.68 | 1,477,749,527.25 | 5.7% | 11.25 | :wavy_dash: `byte eq` (Unstable with ~88,952,135.5 iters. Increase `minEpochIterations` to e.g. 889521355)
| 0.93 | 1,080,467,638.84 | 7.1% | 11.70 | :wavy_dash: `byte eq (!=)` (Unstable with ~67,567,254.9 iters. Increase `minEpochIterations` to e.g. 675672549)
| 3.97 | 251,951,203.85 | 5.9% | 12.43 | :wavy_dash: `byte eq (unroll)` (Unstable with ~16,798,835.5 iters. Increase `minEpochIterations` to e.g. 167988355)
| 0.45 | 2,235,516,447.92 | 11.5% | 12.20 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~146,274,909.9 iters. Increase `minEpochIterations` to e.g. 1462749099)
| 0.50 | 1,993,476,501.61 | 4.2% | 11.99 | `memcmp"95"`
| 0.45 | 2,236,046,246.61 | 9.2% | 12.24 | :wavy_dash: `byte cmp` (Unstable with ~139,306,958.2 iters. Increase `minEpochIterations` to e.g. 1393069582)
| 0.68 | 1,475,459,541.96 | 3.4% | 12.30 | `byte eq`
| 0.96 | 1,046,039,764.57 | 3.7% | 12.19 | `byte eq (!=)`
| 4.27 | 233,965,678.34 | 4.6% | 12.21 | `byte eq (unroll)`
| 0.41 | 2,431,612,333.48 | 4.4% | 11.41 | `byte eq (unroll branch)`
| 0.56 | 1,771,493,746.27 | 3.7% | 12.11 | `memcmp"111"`
| 0.51 | 1,941,766,995.59 | 3.9% | 11.59 | `byte cmp`
| 0.69 | 1,441,740,022.14 | 3.4% | 12.69 | `byte eq`
| 0.97 | 1,027,104,436.55 | 5.9% | 12.17 | :wavy_dash: `byte eq (!=)` (Unstable with ~66,054,897.6 iters. Increase `minEpochIterations` to e.g. 660548976)
| 5.01 | 199,614,764.64 | 4.7% | 11.71 | `byte eq (unroll)`
| 0.54 | 1,855,900,913.62 | 4.9% | 11.87 | `byte eq (unroll branch)`
| 0.65 | 1,536,066,820.00 | 7.8% | 12.08 | :wavy_dash: `memcmp"127"` (Unstable with ~97,850,640.9 iters. Increase `minEpochIterations` to e.g. 978506409)
| 0.56 | 1,788,930,511.46 | 5.3% | 12.22 | :wavy_dash: `byte cmp` (Unstable with ~114,511,193.2 iters. Increase `minEpochIterations` to e.g. 1145111932)
| 0.72 | 1,379,365,801.51 | 3.7% | 12.05 | `byte eq`
| 1.04 | 959,945,097.61 | 4.0% | 12.12 | `byte eq (!=)`
| 5.70 | 175,555,618.85 | 4.0% | 11.51 | `byte eq (unroll)`
| 0.49 | 2,033,352,502.24 | 4.4% | 11.92 | `byte eq (unroll branch)`
| 0.67 | 1,497,982,821.20 | 3.8% | 11.73 | `memcmp"143"`
| 0.60 | 1,657,426,805.89 | 7.8% | 11.98 | :wavy_dash: `byte cmp` (Unstable with ~105,054,034.4 iters. Increase `minEpochIterations` to e.g. 1050540344)
| 0.80 | 1,244,370,392.81 | 5.9% | 12.68 | :wavy_dash: `byte eq` (Unstable with ~84,544,000.8 iters. Increase `minEpochIterations` to e.g. 845440008)
| 1.10 | 906,805,942.16 | 4.8% | 12.00 | `byte eq (!=)`
| 6.07 | 164,799,878.83 | 3.5% | 12.09 | `byte eq (unroll)`
| 0.66 | 1,524,836,223.99 | 8.6% | 11.99 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~98,337,616.8 iters. Increase `minEpochIterations` to e.g. 983376168)
Ok now we're definitely beating memcmp...what else can we do? How about sse?
int byte_eq_sse_negated(const char* pattern, const char* input, size_t sz) noexcept {
int32_t ret = 0;
const size_t loop_count = sz / (sizeof(__m128i));
const size_t loop_rem = sz % (sizeof(__m128i));
for (size_t i = 0; i < loop_count; i++) {
__m128i lhs = _mm_loadu_si128(((const __m128i*)pattern) + i);
__m128i rhs = _mm_loadu_si128(((const __m128i*)input) + i);
__m128i tmp = _mm_cmpeq_epi8(lhs, rhs);
ret |= ((~_mm_movemask_epi8(tmp)) & 0xffff);
if (ret != 0) {
return ret;
}
}
pattern += (loop_count * sizeof(__m128i));
input += (loop_count * sizeof(__m128i));
for (size_t i = 0; i < loop_rem; i++) {
ret |= ((const char*)pattern)[i] ^ ((const char*)input)[i];
}
return ret;
}
| ns/op | op/s | err% | total | memcmp benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 0.27 | 3,639,340,670.25 | 8.7% | 12.02 | :wavy_dash: `memcmp"20"` (Unstable with ~225,904,652.9 iters. Increase `minEpochIterations` to e.g. 2259046529)
| 0.19 | 5,187,746,931.10 | 5.0% | 11.85 | :wavy_dash: `byte cmp` (Unstable with ~325,383,142.7 iters. Increase `minEpochIterations` to e.g. 3253831427)
| 0.42 | 2,402,869,648.28 | 4.8% | 11.43 | `byte eq`
| 1.49 | 672,321,405.42 | 2.9% | 12.24 | `byte eq (unroll)`
| 0.19 | 5,230,157,716.42 | 7.0% | 11.79 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~318,460,252.5 iters. Increase `minEpochIterations` to e.g. 3184602525)
| 0.30 | 3,352,545,401.62 | 7.0% | 12.76 | :wavy_dash: `byte eq (sse)` (Unstable with ~227,442,129.9 iters. Increase `minEpochIterations` to e.g. 2274421299)
| 0.30 | 3,292,934,370.24 | 6.3% | 12.37 | :wavy_dash: `byte eq (sse negated)` (Unstable with ~215,028,432.0 iters. Increase `minEpochIterations` to e.g. 2150284320)
| 0.38 | 2,664,842,656.94 | 4.8% | 12.23 | `memcmp"36"`
| 0.30 | 3,360,321,590.28 | 6.6% | 11.93 | :wavy_dash: `byte cmp` (Unstable with ~216,265,698.2 iters. Increase `minEpochIterations` to e.g. 2162656982)
| 0.46 | 2,152,682,255.41 | 4.0% | 11.80 | `byte eq`
| 2.18 | 458,413,508.18 | 7.5% | 11.64 | :wavy_dash: `byte eq (unroll)` (Unstable with ~28,080,114.0 iters. Increase `minEpochIterations` to e.g. 280801140)
| 0.32 | 3,150,386,319.97 | 5.4% | 11.80 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~197,097,241.8 iters. Increase `minEpochIterations` to e.g. 1970972418)
| 0.35 | 2,858,841,054.18 | 5.4% | 11.89 | :wavy_dash: `byte eq (sse)` (Unstable with ~179,117,231.8 iters. Increase `minEpochIterations` to e.g. 1791172318)
| 0.34 | 2,933,341,057.30 | 3.7% | 12.17 | `byte eq (sse negated)`
| 0.39 | 2,596,104,253.92 | 5.5% | 10.91 | :wavy_dash: `memcmp"52"` (Unstable with ~145,133,815.8 iters. Increase `minEpochIterations` to e.g. 1451338158)
| 0.36 | 2,784,535,353.22 | 7.8% | 11.93 | :wavy_dash: `byte cmp` (Unstable with ~175,649,868.6 iters. Increase `minEpochIterations` to e.g. 1756498686)
| 0.49 | 2,042,163,365.75 | 9.0% | 11.87 | :wavy_dash: `byte eq` (Unstable with ~128,471,168.1 iters. Increase `minEpochIterations` to e.g. 1284711681)
| 2.81 | 355,409,341.75 | 10.8% | 12.85 | :wavy_dash: `byte eq (unroll)` (Unstable with ~23,691,396.9 iters. Increase `minEpochIterations` to e.g. 236913969)
| 0.30 | 3,339,568,111.69 | 6.4% | 11.84 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~211,939,981.5 iters. Increase `minEpochIterations` to e.g. 2119399815)
| 0.39 | 2,585,576,883.82 | 4.6% | 11.33 | `byte eq (sse)`
| 0.42 | 2,405,745,381.77 | 8.7% | 12.48 | :wavy_dash: `byte eq (sse negated)` (Unstable with ~160,505,323.5 iters. Increase `minEpochIterations` to e.g. 1605053235)
| 0.49 | 2,039,744,333.37 | 6.8% | 12.15 | :wavy_dash: `memcmp"68"` (Unstable with ~134,187,383.2 iters. Increase `minEpochIterations` to e.g. 1341873832)
| 0.45 | 2,199,607,667.63 | 5.6% | 11.75 | :wavy_dash: `byte cmp` (Unstable with ~138,510,164.9 iters. Increase `minEpochIterations` to e.g. 1385101649)
| 0.51 | 1,970,186,281.98 | 7.2% | 11.34 | :wavy_dash: `byte eq` (Unstable with ~115,349,393.1 iters. Increase `minEpochIterations` to e.g. 1153493931)
| 3.35 | 298,291,485.00 | 2.1% | 11.99 | `byte eq (unroll)`
| 0.45 | 2,239,646,875.89 | 3.9% | 11.62 | `byte eq (unroll branch)`
| 0.42 | 2,358,219,882.01 | 4.6% | 11.47 | `byte eq (sse)`
| 0.45 | 2,227,821,467.54 | 3.5% | 11.48 | `byte eq (sse negated)`
| 0.47 | 2,130,793,178.80 | 6.2% | 12.06 | :wavy_dash: `memcmp"84"` (Unstable with ~135,413,562.9 iters. Increase `minEpochIterations` to e.g. 1354135629)
| 0.49 | 2,028,145,930.98 | 7.7% | 11.47 | :wavy_dash: `byte cmp` (Unstable with ~123,654,400.0 iters. Increase `minEpochIterations` to e.g. 1236544000)
| 0.46 | 2,197,228,675.53 | 1.8% | 11.17 | `byte eq`
| 3.35 | 298,879,526.89 | 0.9% | 11.71 | `byte eq (unroll)`
| 0.33 | 2,993,380,841.07 | 0.7% | 12.05 | `byte eq (unroll branch)`
| 0.45 | 2,233,079,396.92 | 0.2% | 11.74 | `byte eq (sse)`
| 0.44 | 2,290,852,513.09 | 0.7% | 11.68 | `byte eq (sse negated)`
| 0.46 | 2,191,021,089.28 | 2.1% | 11.84 | `memcmp"100"`
| 0.52 | 1,930,693,355.65 | 1.2% | 11.67 | `byte cmp`
| 0.56 | 1,781,280,821.22 | 4.5% | 11.97 | `byte eq`
| 4.40 | 227,220,556.09 | 5.0% | 11.75 | :wavy_dash: `byte eq (unroll)` (Unstable with ~14,610,565.0 iters. Increase `minEpochIterations` to e.g. 146105650)
| 0.49 | 2,031,545,197.07 | 3.2% | 11.66 | `byte eq (unroll branch)`
| 0.60 | 1,662,959,104.85 | 3.2% | 11.86 | `byte eq (sse)`
| 0.57 | 1,753,521,377.71 | 2.8% | 12.27 | `byte eq (sse negated)`
| 0.54 | 1,854,203,036.44 | 3.2% | 11.96 | `memcmp"116"`
| 0.64 | 1,552,203,704.78 | 3.7% | 12.02 | `byte cmp`
| 0.56 | 1,781,938,373.32 | 4.6% | 11.76 | `byte eq`
| 5.34 | 187,352,287.08 | 2.7% | 11.81 | `byte eq (unroll)`
| 0.50 | 2,000,580,958.83 | 5.0% | 11.62 | :wavy_dash: `byte eq (unroll branch)` (Unstable with ~123,540,971.0 iters. Increase `minEpochIterations` to e.g. 1235409710)
| 0.57 | 1,746,202,968.95 | 2.4% | 12.03 | `byte eq (sse)`
| 0.58 | 1,714,887,198.58 | 2.6% | 11.79 | `byte eq (sse negated)`
| 0.60 | 1,663,864,716.74 | 2.4% | 11.89 | `memcmp"132"`
| 0.76 | 1,317,747,603.11 | 3.6% | 11.39 | `byte cmp`
| 0.61 | 1,650,014,026.02 | 6.4% | 12.19 | :wavy_dash: `byte eq` (Unstable with ~107,921,348.2 iters. Increase `minEpochIterations` to e.g. 1079213482)
| 5.66 | 176,788,069.69 | 3.7% | 11.72 | `byte eq (unroll)`
| 0.60 | 1,662,745,534.62 | 2.9% | 11.83 | `byte eq (unroll branch)`
| 0.66 | 1,524,072,200.48 | 2.3% | 12.11 | `byte eq (sse)`
| 0.60 | 1,666,577,950.69 | 5.7% | 12.38 | :wavy_dash: `byte eq (sse negated)` (Unstable with ~106,968,760.7 iters. Increase `minEpochIterations` to e.g. 1069687607)
| 0.59 | 1,692,483,586.30 | 6.8% | 11.99 | :wavy_dash: `memcmp"148"` (Unstable with ~107,205,549.6 iters. Increase `minEpochIterations` to e.g. 1072055496)
| 0.84 | 1,196,049,989.25 | 3.1% | 11.78 | `byte cmp`
| 0.59 | 1,692,535,174.86 | 7.3% | 12.41 | :wavy_dash: `byte eq` (Unstable with ~109,257,093.5 iters. Increase `minEpochIterations` to e.g. 1092570935)
| 5.99 | 167,072,990.93 | 5.9% | 11.93 | :wavy_dash: `byte eq (unroll)` (Unstable with ~10,411,640.7 iters. Increase `minEpochIterations` to e.g. 104116407)
| 0.61 | 1,651,621,364.54 | 4.5% | 11.81 | `byte eq (unroll branch)`
| 0.65 | 1,536,109,057.69 | 6.5% | 12.39 | :wavy_dash: `byte eq (sse)` (Unstable with ~102,740,948.7 iters. Increase `minEpochIterations` to e.g. 1027409487)
| 0.64 | 1,570,911,526.41 | 2.3% | 12.18 | `byte eq (sse negated)`
Ok...also not bad, but not exactly much better than what we had.
I'm going to stop here since I'm happy with my results, granted there's another consideration for memcmp
, suppose we know one buffer is a constant? Can we get c++ to generate an unrolled loop of comparisons? Will that beat these tight loops? What if we force the compiler to vectorize the loops?
I should add, if you're staring at byte eq
and byte eq (!=)
that byte eq (!=)
does not become branchless. That's right, you do need to be careful of which operators you're using when trying to branchless code. In the byte eq (!=)
case it's actually roughly equivalent to how byte cmp functions. Roughly speaking your best bet is to stick to bitwise operations, logical ones like !=
the compiler will consider the branchy approach to be the faster one.
Forgive the data dump, I've yet to find a nice way to plot this out in html.